首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
In agglomerative hierarchical clustering, pair-group methods suffer from a problem of non-uniqueness when two or more distances between different clusters coincide during the amalgamation process. The traditional approach for solving this drawback has been to take any arbitrary criterion in order to break ties between distances, which results in different hierarchical classifications depending on the criterion followed. In this article we propose a variable-group algorithm that consists in grouping more than two clusters at the same time when ties occur. We give a tree representation for the results of the algorithm, which we call a multidendrogram, as well as a generalization of the Lance andWilliams’ formula which enables the implementation of the algorithm in a recursive way. The authors thank A. Arenas for discussion and helpful comments. This work was partially supported by DGES of the Spanish Government Project No. FIS2006–13321–C02–02 and by a grant of Universitat Rovira i Virgili.  相似文献   

2.
In this paper, we consider several generalizations of the popular Ward’s method for agglomerative hierarchical clustering. Our work was motivated by clustering software, such as the R function hclust, which accepts a distance matrix as input and applies Ward’s definition of inter-cluster distance to produce a clustering. The standard version of Ward’s method uses squared Euclidean distance to form the distance matrix. We explore the effect on the clustering of using other definitions of distance, such as the Minkowski distance.  相似文献   

3.
Proportional link linkage (PLL) clustering methods are a parametric family of monotone invariant agglomerative hierarchical clustering methods. This family includes the single, minimedian, and complete linkage clustering methods as special cases; its members are used in psychological and ecological applications. Since the literature on clustering space distortion is oriented to quantitative input data, we adapt its basic concepts to input data with only ordinal significance and analyze the space distortion properties of PLL methods. To enable PLL methods to be used when the numbern of objects being clustered is large, we describe an efficient PLL algorithm that operates inO(n 2 logn) time andO(n 2) space.This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and by the Austrian Fonds zur Förderung der wissenschaftlichen Forschung.  相似文献   

4.
Efficient algorithms for agglomerative hierarchical clustering methods   总被引:11,自引:4,他引:7  
Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n 3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n 3) worst-case time, can reasonably be expected to exhibit 0(n 2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n 2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n 2) time and space.Whenevern objects are characterized byk-tuples of real numbers, they may be clustered by any of a family of centroid SAHN clustering methods. These methods are based on a geometric model in which clusters are represented by points ink-dimensional real space and points being agglomerated are replaced by a single (centroid) point. For this model, we have solved a class of special packing problems involving point-symmetric convex objects and have exploited it to design an efficient centroid clustering algorithm. Specifically, we describe a centroid SAHN clustering algorithm that requires 0(n 2) time, in the worst case, for fixedk and for a family of dissimilarity measures including the Manhattan, Euclidean, Chebychev and all other Minkowski metrics.This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and by the Austrian Fonds zur Förderung der wissenschaftlichen Forschung.  相似文献   

5.
This paper presents asymmetric agglomerative hierarchical clustering algorithms in an extensive view point. First, we develop a new updating formula for these algorithms, proposing a general framework to incorporate many algorithms. Next we propose measures to evaluate the fit of asymmetric clustering results to data. Then we demonstrate numerical examples with real data, using the new updating formula and the indices of fit. Discussing empirical findings, through the demonstrative examples, we show new insights into the asymmetric clustering.  相似文献   

6.
k consisting of k clusters, with k > 2. Bottom-up agglomerative approaches are also commonly used to construct partitions, and we discuss these in terms of worst-case performance for metric data sets. Our main contribution derives from a new restricted partition formulation that requires each cluster to be an interval of a given ordering of the objects being clustered. Dynamic programming can optimally split such an ordering into a partition Pk for a large class of objectives that includes min-diameter. We explore a variety of ordering heuristics and show that our algorithm, when combined with an appropriate ordering heuristic, outperforms traditional algorithms on both random and non-random data sets.  相似文献   

7.
An error variance approach to two-mode hierarchical clustering   总被引:2,自引:2,他引:0  
A new agglomerative method is proposed for the simultaneous hierarchical clustering of row and column elements of a two-mode data matrix. The procedure yields a nested sequence of partitions of the union of two sets of entities (modes). A two-mode cluster is defined as the union of subsets of the respective modes. At each step of the agglomerative process, the algorithm merges those clusters whose fusion results in the smallest possible increase in an internal heterogeneity measure. This measure takes into account both the variance within the respective cluster and its centroid effect defined as the squared deviation of its mean from the maximum entry in the input matrix. The procedure optionally yields an overlapping cluster solution by assigning further row and/or column elements to clusters existing at a preselected hierarchical level. Applications to real data sets drawn from consumer research concerning brand-switching behavior and from personality research concerning the interaction of behaviors and situations demonstrate the efficacy of the method at revealing the underlying two-mode similarity structure.  相似文献   

8.
The nearest neighbor interchange (nni) metric is a distance measure providing a quantitative measure of dissimilarity between two unrooted binary trees with labeled leaves. The metric has a transparent definition in terms of a simple transformation of binary trees, but its use in nontrivial problems is usually prevented by the absence of a computationally efficient algorithm. Since recent attempts to discover such an algorithm continue to be unsuccessful, we address the complementary problem of designing an approximation to the nni metric. Such an approximation should be well-defined, efficient to compute, comprehensible to users, relevant to applications, and a close fit to the nni metric; the challenge, of course, is to compromise these objectives in such a way that the final design is acceptable to users with practical and theoretical orientations. We describe an approximation algorithm that appears to satisfy adequately these objectives. The algorithm requires O(n) space to compute dissimilarity between binary trees withn labeled leaves; it requires O(n logn) time for rooted trees and O(n 2 logn) time for unrooted trees. To help the user interpret the dissimilarity measures based on this algorithm, we describe empirical distributions of dissimilarities between pairs of randomly selected trees for both rooted and unrooted cases.The Natural Sciences and Engineering Research Council of Canada partially supported this work with Grant A-4142.  相似文献   

9.
10.
Direct multicriteria clustering algorithms   总被引:1,自引:0,他引:1  
In a multicriteria clustering problem, optimization over more than one criterion is required. The problem can be treated in different ways: by reduction to a clustering problem with the single criterion obtained as a combination of the given criteria; by constrained clustering algorithms where a selected critetion is considered as the clustering criterion and all others determine the constraints; or by direct algorithms. In this paper two types of direct algorithms for solving multicriteria clustering problem are proposed: the modified relocation algorithm, and the modified agglomerative algorithm. Different elaborations of these two types of algorithms are discussed and compared. Finally, two applications of the proposed algorithms are presented. Elaborated version of the talks presented at the First Conference of the International Federation of Classification Societies, Aachen, 1987, at the International Conference on Social Science Methodology, Dubrovnik, 1988, and at the Second Conference of the International Federation of Classification Societies, Charlottesville, 1989. This work was supported in part by the Research Council of Slovenia.  相似文献   

11.
A new approach to isotonic agglomerative hierarchical clustering   总被引:1,自引:1,他引:0  
Hierarchical clustering methods must be isotonic for the construction of ultrametric. We present a general strategy to widen the class of isotonic methods implemented by agglomerative algorithms. At each step of the agglomeration we allow one of several admissible pairs to be chosen. Then under mild assumptions an appropriate definition of admissibility guarantees isotony. Moreover we consider the use of the new methods to compute locally optimal ultrametrics. Two examples demonstrate the ability to define new agglomerative methods superior to their traditional competitors.  相似文献   

12.
In the election of a hierarchical clustering method, theoretic properties may give some insight to determine which method is the most suitable to treat a clustering problem. Herein, we study some basic properties of two hierarchical clustering methods: α-unchaining single linkage or SL(α) and a modified version of this one, SL?(α). We compare the results with the properties satisfied by the classical linkage-based hierarchical clustering methods.  相似文献   

13.
Clique optimization (CLOPT) is a family of graph clustering procedures that construct parsimonious ultrametrics by executing a sequence of divisive and agglomerative operations. Every CLOPT procedure is associated with a distinct graph-partitioning heuristic. Seven HCS methods, a mathematical programming algorithm, and two CLOPT heuristics were evaluated on simulated data. These data were obtained by distorting ultrametric partitions and hierarchies. In general, internally optimal models yielded externally optimal models. By recovering near-optimal solutions more consistently, CLOPT2 emerged as the most robust technique.  相似文献   

14.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.  相似文献   

15.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach.  相似文献   

16.
MCLUST is a software package for model-based clustering, density estimation and discriminant analysis interfaced to the S-PLUS commercial software and the R language. It implements parameterized Gaussian hierarchical clustering algorithms and the EM algorithm for parameterized Gaussian mixture models with the possible addition of a Poisson noise term. Also included are functions that combine hierarchical clustering, EM and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation, and discriminant analysis. MCLUST provides functionality for displaying and visualizing clustering and classification results. A web page with related links can be found at .  相似文献   

17.
One of the most important problems in classification is that of quantitative comparison of hierarchical trees. In this note we answer an open problem of Culík and Wood (1982) concerning the nearest neighbor interchange metric by proving that its underlying decision problem is NP-complete.  相似文献   

18.
Let G = (V,E,w) be a graph with vertex and edge sets V and E, respectively, and w:E → R + a function which assigns a positive weight or length to each edge of G. G is called a realization of a finite metric space (M,d), with M = { 1,...,n} if and only if { 1,...,n} ⫅ V and d(i,j) is equal to the length of the shortest chain linking i and j in G ∀ i,j = 1,...,n. A realization G of (M,d), is said optimal if the sum of its weights is minimal among all the realizations of (M,d). Consider a partition of M into two nonempty subsets K and L, and let e be an edge in a realization G of (M,d); we say that e is a bridge linking K with L if e belongs to all chains in G linking a vertex of K with a vertex of L. The Metric Bridge Partition Problem is to determine if the elements of a finite metric space (M,d) can be partitioned into two nonempty subsets K and L such that all optimal realizations of (M,d) contain a bridge linking K with L. We prove in this paper that this problem is polynomially solvable. We also describe an algorithm that constructs an optimal realization of (M,d) from optimal realizations of (K,d|K) and (L,d|L).  相似文献   

19.
A validation study of a variable weighting algorithm for cluster analysis   总被引:1,自引:0,他引:1  
De Soete (1986, 1988) proposed a variable weighting procedure when Euclidean distance is used as the dissimilarity measure with an ultrametric hierarchical clustering method. The algorithm produces weighted distances which approximate ultrametric distances as closely as possible in a least squares sense. The present simulation study examined the effectiveness of the De Soete procedure for an applications problem for which it was not originally intended. That is, to determine whether or not the algorithm can be used to reduce the influence of variables which are irrelevant to the clustering present in the data. The simulation study examined the ability of the procedure to recover a variety of known underlying cluster structures. The results indicate that the algorithm is effective in identifying extraneous variables which do not contribute information about the true cluster structure. Weights near 0.0 were typically assigned to such extraneous variables. Furthermore, the variable weighting procedure was not adversely effected by the presence of other forms of error in the data. In general, it is recommended that the variable weighting procedure be used for applied analyses when Euclidean distance is employed with ultrametric hierarchical clustering methods.  相似文献   

20.
This paper studies the random indexed dendograms produced by agglomerative hierarchical algorithms under the non-classifiability hypothesis of independent identically distributed (i.i.d.) dissimilarities. New tests for classifiability are deduced. The corresponding test statistics are random variables attached to the indexed dendrograms, such as the indices, the survival time of singletons, the value of the ultrametric between two given points, or the size of classes in the different levels of the dendogram. For an indexed dendogram produced by the Single Link method on i.i.d. dissimilarities, the distribution of these random variables is computed, thus leading to explicit tests. For the case of the Average and Complete Link methods, some asymptotic results are presented. The proofs rely essentially on the theory of random graphs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号