首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 456 毫秒
1.
We construct a weighted Euclidean distance that approximates any distance or dissimilarity measure between individuals that is based on a rectangular cases-by-variables data matrix. In contrast to regular multidimensional scaling methods for dissimilarity data, our approach leads to biplots of individuals and variables while preserving all the good properties of dimension-reduction methods that are based on the singular-value decomposition. The main benefits are the decomposition of variance into components along principal axes, which provide the numerical diagnostics known as contributions, and the estimation of nonnegative weights for each variable. The idea is inspired by the distance functions used in correspondence analysis and in principal component analysis of standardized data, where the normalizations inherent in the distances can be considered as differential weighting of the variables. In weighted Euclidean biplots, we allow these weights to be unknown parameters, which are estimated from the data to maximize the fit to the chosen distances or dissimilarities. These weights are estimated using a majorization algorithm. Once this extra weight-estimation step is accomplished, the procedure follows the classical path in decomposing the matrix and displaying its rows and columns in biplots.  相似文献   

2.
An approach is presented for analyzing a heterogeneous set of categorical variables assumed to form a limited number of homogeneous subsets. The variables generate a particular set of proximities between the objects in the data matrix, and the objective of the analysis is to represent the objects in lowdimensional Euclidean spaces, where the distances approximate these proximities. A least squares loss function is minimized that involves three major components: a) the partitioning of the heterogeneous variables into homogeneous subsets; b) the optimal quantification of the categories of the variables, and c) the representation of the objects through multiple multidimensional scaling tasks performed simultaneously. An important aspect from an algorithmic point of view is in the use of majorization. The use of the procedure is demonstrated by a typical example of possible application, i.e., the analysis of categorical data obtained in a free-sort task. The results of points of view analysis are contrasted with a standard homogeneity analysis, and the stability is studied through a Jackknife analysis.  相似文献   

3.
Two algorithms for pyramidal classification — a generalization of hierarchical classification — are presented that can work with incomplete dissimilarity data. These approaches — a modification of the pyramidal ascending classification algorithm and a least squares based penalty method — are described and compared using two different types of complete dissimilarity data in which randomly chosen dissimilarities are assumed missing and the non-missing ones are subjected to random error. We also consider relationships between hierarchical classification and pyramidal classification solutions when both are based on incomplete dissimilarity data.  相似文献   

4.
Clustering techniques are based upon a dissimilarity or distance measure between objects and clusters. This paper focuses on the simplex space, whose elements??compositions??are subject to non-negativity and constant-sum constraints. Any data analysis involving compositions should fulfill two main principles: scale invariance and subcompositional coherence. Among fuzzy clustering methods, the FCM algorithm is broadly applied in a variety of fields, but it is not well-behaved when dealing with compositions. Here, the adequacy of different dissimilarities in the simplex, together with the behavior of the common log-ratio transformations, is discussed in the basis of compositional principles. As a result, a well-founded strategy for FCM clustering of compositions is suggested. Theoretical findings are accompanied by numerical evidence, and a detailed account of our proposal is provided. Finally, a case study is illustrated using a nutritional data set known in the clustering literature.  相似文献   

5.
NP-hard Approximation Problems in Overlapping Clustering   总被引:1,自引:1,他引:0  
Lp -norm (p < ∞). These problems also correspond to the approximation by a strongly Robinson dissimilarity or by a dissimilarity fulfilling the four-point inequality (Bandelt 1992; Diatta and Fichet 1994). The results are extended to circular strongly Robinson dissimilarities, indexed k-hierarchies (Jardine and Sibson 1971, pp. 65-71), and to proper dissimilarities satisfying the Bertrand and Janowitz (k + 2)-point inequality (Bertrand and Janowitz 1999). Unidimensional scaling (linear or circular) is reinterpreted as a clustering problem and its hardness is established, but only for the L 1 norm.  相似文献   

6.
Analytic procedures for classifying objects are commonly based on the product-moment correlation as a measure of object similarity. This statistic, however, generally does not represent an invariant index of similarity between two objects if they are measured along different bipolar variables where the direction of measurement for each variable is arbitrary. A computer simulation study compared Cohen's (1969) proposed solution to the problem, the invariant similarity coefficientr c , with the mean product-moment correlation based on all possible changes in the measurement direction of individual variables within a profile of scores. The empirical observation thatr c approaches the mean product-moment correlation with increases in the number of scores in the profiles was interpreted as encouragement for the use ofr c in classification research. Some cautions regarding its application were noted.This research was supported by the Social Sciences and Humanities Research Council of Canada, Grant no. 410-83-0633, and by the University of Toronto.  相似文献   

7.
In various data settings, it is necessary to compare observations from disparate data sources. We assume the data is in the dissimilarity representation (P?kalska and Duin, 2005) and investigate a joint embedding method (Priebe et al., 2013) that results in a commensurate representation of disparate dissimilarities. We further assume that there are “matched” observations from different conditions which can be considered to be highly similar, for the sake of inference. The joint embedding results in the joint optimization of fidelity (preservation of within-condition dissimilarities) and commensurability (preservation of between-condition dissimilarities between matched observations). We show that the tradeoff between these two criteria can be made explicit using weighted raw stress as the objective function for multidimensional scaling. In our investigations, we use a weight parameter, w, to control the tradeoff, and choose match detection as the inference task. Our results show weights that are optimal (with respect to the inference task) are different than equal weights for commensurability and fidelity and the proposed weighted embedding scheme provides significant improvements in statistical power.  相似文献   

8.
The paper presents a methodology for classifying three-way dissimilarity data, which are reconstructed by a small number of consensus classifications of the objects each defined by a sum of two order constrained distance matrices, so as to identify both a partition and an indexed hierarchy. Specifically, the dissimilarity matrices are partitioned in homogeneous classes and, within each class, a partition and an indexed hierarchy are simultaneously fitted. The model proposed is mathematically formalized as a constrained mixed-integer quadratic problem to be fitted in the least-squares sense and an alternating least-squares algorithm is proposed which is computationally efficient. Two applications of the methodology are also described together with an extensive simulation to investigate the performance of the algorithm.  相似文献   

9.
The nearest neighbor interchange (nni) metric is a distance measure providing a quantitative measure of dissimilarity between two unrooted binary trees with labeled leaves. The metric has a transparent definition in terms of a simple transformation of binary trees, but its use in nontrivial problems is usually prevented by the absence of a computationally efficient algorithm. Since recent attempts to discover such an algorithm continue to be unsuccessful, we address the complementary problem of designing an approximation to the nni metric. Such an approximation should be well-defined, efficient to compute, comprehensible to users, relevant to applications, and a close fit to the nni metric; the challenge, of course, is to compromise these objectives in such a way that the final design is acceptable to users with practical and theoretical orientations. We describe an approximation algorithm that appears to satisfy adequately these objectives. The algorithm requires O(n) space to compute dissimilarity between binary trees withn labeled leaves; it requires O(n logn) time for rooted trees and O(n 2 logn) time for unrooted trees. To help the user interpret the dissimilarity measures based on this algorithm, we describe empirical distributions of dissimilarities between pairs of randomly selected trees for both rooted and unrooted cases.The Natural Sciences and Engineering Research Council of Canada partially supported this work with Grant A-4142.  相似文献   

10.
This paper studies the random indexed dendograms produced by agglomerative hierarchical algorithms under the non-classifiability hypothesis of independent identically distributed (i.i.d.) dissimilarities. New tests for classifiability are deduced. The corresponding test statistics are random variables attached to the indexed dendrograms, such as the indices, the survival time of singletons, the value of the ultrametric between two given points, or the size of classes in the different levels of the dendogram. For an indexed dendogram produced by the Single Link method on i.i.d. dissimilarities, the distribution of these random variables is computed, thus leading to explicit tests. For the case of the Average and Complete Link methods, some asymptotic results are presented. The proofs rely essentially on the theory of random graphs.  相似文献   

11.
Given a set of objects and a symmetric matrix of dissimilarities between them, Unidimensional Scaling is the problem of finding a representation by locating points on a continuum. Approximating dissimilarities by the absolute value of the difference between coordinates on a line constitutes a serious computational problem. This paper presents an algorithm that implements Simulated Annealing in a new way, via a strategy based on a weighted alternating process that uses permutations and point-wise translations to locate the optimal configuration. Explicit implementation details are given for least squares loss functions and for least absolute deviations. The weighted, alternating process is shown to outperform earlier implementations of Simulated Annealing and other optimization strategies for Unidimensional Scaling in run time efficiency, in solution quality, or in both.  相似文献   

12.
Two fundamental approaches to the comparison of classifications (e g, partitions on the same finite set of objects) can be distinguished One approach is based upon measures of metric dissimilarity while the other is based upon measures of similarity, or consensus These approaches are not necessarily simple complements of each other Instead, each captures different, limited views of comparison of two classifications The properties of these measures are clarified by their relationships to Day's complexity models and to association measures of numerical taxonomy The two approaches to comparison are equated with the use of separation and minimum value sensitive measures, suggesting the potential application of an intermediate sensitive measure to the problem of comparison of classifications Such a measure is a linear combination of separation sensitive and minimum value sensitive components The application of these intermediate measures is contrasted with the two extremes The intermediate measure for the comparison of classifications is applied to a problem of character weighting arising in the analysis of Australian stream basinsWe thank Bill Day, Mike Austin, Peter Minchin and two anonymous referees for many helpful comments We also thank P Arabie for useful discussion of consensus methods and character weighting  相似文献   

13.
One key point in cluster analysis is to determine a similarity or dissimilarity measure between data objects. When working with time series, the concept of similarity can be established in different ways. In this paper, several non-parametric statistics originally designed to test the equality of the log-spectra of two stochastic processes are proposed as dissimilarity measures between time series data. Their behavior in time series clustering is analyzed throughout a simulation study, and compared with the performance of several model-free and model-based dissimilarity measures. Up to three different classification settings were considered: (i) to distinguish between stationary and non-stationary time series, (ii) to classify different ARMA processes and (iii) to classify several non-linear time series models. As it was expected, the performance of a particular dissimilarity metric strongly depended on the type of processes subjected to clustering. Among all the measures studied, the nonparametric distances showed the most robust behavior.  相似文献   

14.
A classification of presence/absence based dissimilarity coefficients   总被引:1,自引:1,他引:0  
Several desirable order properties for dissimilarity coefficients based on presence/absence of attributes are given and several popular dissimilarity coefficients are examined with respect to these properties. A characterization for rational functions with linear numerator and linear denominator satisfying all of the desirable properties is given.  相似文献   

15.
在对古代典籍中关于恒星亮度、亮变记载全面整理的基础上,对恒星亮度梯度记录作了详细的分析,证明中国古代也有类似6等级的亮度分级方法;对古代所有提到"消失"光变描述的星官,作了现代变星的对比证认,证明这些记载描述的都是大气消光现象,而非古人注意到了星官中有变星存在;对全天三大变星--大陵五、造父一、蒭藁增二的古代光变描述的全面分析,证明中国古代对这三颗最著名的变星都没有明确的光变记载;经全面分析古代记录,得出中国最早的变星记录出自<明史·天文志>载洪武二十九年(1396年)井宿七的光变记录,其时代虽然较晚,仍然比西方最早的变星记录早了200年.  相似文献   

16.
Efficient algorithms for agglomerative hierarchical clustering methods   总被引:11,自引:4,他引:7  
Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n 3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n 3) worst-case time, can reasonably be expected to exhibit 0(n 2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n 2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n 2) time and space.Whenevern objects are characterized byk-tuples of real numbers, they may be clustered by any of a family of centroid SAHN clustering methods. These methods are based on a geometric model in which clusters are represented by points ink-dimensional real space and points being agglomerated are replaced by a single (centroid) point. For this model, we have solved a class of special packing problems involving point-symmetric convex objects and have exploited it to design an efficient centroid clustering algorithm. Specifically, we describe a centroid SAHN clustering algorithm that requires 0(n 2) time, in the worst case, for fixedk and for a family of dissimilarity measures including the Manhattan, Euclidean, Chebychev and all other Minkowski metrics.This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and by the Austrian Fonds zur Förderung der wissenschaftlichen Forschung.  相似文献   

17.
A dissimilarity D on a finite set S is said to be Robinsonian if S can be totally ordered in such a way that, for every i < j < k, D (i, j) ≤ D (i, k) and D (j, k) ≤ D (i, k). Intuitively, D is Robinsonian if S can be represented by points on a line. Recognizing Robinsonian dissimilarities has many applications in seriation and classification. In this paper, we present an optimal O (n 2) algorithm to recognize Robinsonian dissimilarities, where n is the cardinal of S. Our result improves the already known algorithms.  相似文献   

18.
Assume that a dissimilarity measure between elements and subsets of the set being clustered is given. We define the transformation of the set of subsets under which each subset is transformed into the set of all elements whose dissimilarity to it is not greater than a given threshold. Then a cluster is defined as a fixed point of this transformation. Three well-known clustering strategies are considered from this point of view: hierarchical clustering, graph-theoretic methods, and conceptual clustering. For hierarchical clustering generalizations are obtained that allow for overlapping clusters and/or clusters not forming a cover. Three properties of dissimilarity are introduced which guarantee the existence of fixed points for each threshold. We develop the relation to the theory of quasi-concave set functions, to help give an additional interpretation of clusters.  相似文献   

19.
Consider N entities to be classified (e.g., geographical areas), a matrix of dissimilarities between pairs of entities, a graph H with vertices associated with these entities such that the edges join the vertices corresponding to contiguous entities. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside of it. The single-linkage algorithm (ignoring contiguity between entities) provides partitions into M clusters for which the smallest split of the clusters, called split of the partition, is maximum. We study here the partitioning of the set of entities into M connected clusters for all M between N - 1 and 2 (i.e., clusters such that the subgraphs of H induced by their corresponding sets of entities are connected) with maximum split subject to that condition. We first provide an exact algorithm with a (N2) complexity for the particular case in which H is a tree. This algorithm suggests in turn a first heuristic algorithm for the general problem. Several variants of this heuristic are Also explored. We then present an exact algorithm for the general case based on iterative determination of cocycles of subtrees and on the solution of auxiliary set covering problems. As solution of the latter problems is time-consuming for large instances, we provide another heuristic in which the auxiliary set covering problems are solved approximately. Computational results obtained with the exact and heuristic algorithms are presented on test problems from the literature.  相似文献   

20.
The location model is a useful tool in parametric analysis of mixed continuous and categorical variables. In this model, the continuous variables are assumed to follow different multivariate normal distributions for each possible combination of categorical variable values. Using this model, a distance between two populations involving mixed variables can be defined. To date, however, no distributional results have been available, against which to assess the outcomes of practical applications of this distance. The null distribution of estimated distance is therefore considered in this paper, for a range of possible situations. No explicit analytical expressions are derived for this distribution, but easily implementable Monte Carlo schemes are described. These are then applied to previously cited examples.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号