首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
Single linkage clusters on a set of points are the maximal connected sets in a graph constructed by connecting all points closer than a given threshold distance. The complete set of single linkage clusters is obtained from all the graphs constructed using different threshold distances. The set of clusters forms a hierarchical tree, in which each non-singleton cluster divides into two or more subclusters; the runt size for each single linkage cluster is the number of points in its smallest subcluster. The maximum runt size over all single linkage clusters is our proposed test statistic for assessing multimodality. We give significance levels of the test for two null hypotheses, and consider its power against some bimodal alternatives. Research partially supported by NSF Grant No. DMS-8617919.  相似文献   

2.
A validation study of a variable weighting algorithm for cluster analysis   总被引:1,自引:0,他引:1  
De Soete (1986, 1988) proposed a variable weighting procedure when Euclidean distance is used as the dissimilarity measure with an ultrametric hierarchical clustering method. The algorithm produces weighted distances which approximate ultrametric distances as closely as possible in a least squares sense. The present simulation study examined the effectiveness of the De Soete procedure for an applications problem for which it was not originally intended. That is, to determine whether or not the algorithm can be used to reduce the influence of variables which are irrelevant to the clustering present in the data. The simulation study examined the ability of the procedure to recover a variety of known underlying cluster structures. The results indicate that the algorithm is effective in identifying extraneous variables which do not contribute information about the true cluster structure. Weights near 0.0 were typically assigned to such extraneous variables. Furthermore, the variable weighting procedure was not adversely effected by the presence of other forms of error in the data. In general, it is recommended that the variable weighting procedure be used for applied analyses when Euclidean distance is employed with ultrametric hierarchical clustering methods.  相似文献   

3.
Efficient algorithms for agglomerative hierarchical clustering methods   总被引:11,自引:4,他引:7  
Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n 3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n 3) worst-case time, can reasonably be expected to exhibit 0(n 2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n 2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n 2) time and space.Whenevern objects are characterized byk-tuples of real numbers, they may be clustered by any of a family of centroid SAHN clustering methods. These methods are based on a geometric model in which clusters are represented by points ink-dimensional real space and points being agglomerated are replaced by a single (centroid) point. For this model, we have solved a class of special packing problems involving point-symmetric convex objects and have exploited it to design an efficient centroid clustering algorithm. Specifically, we describe a centroid SAHN clustering algorithm that requires 0(n 2) time, in the worst case, for fixedk and for a family of dissimilarity measures including the Manhattan, Euclidean, Chebychev and all other Minkowski metrics.This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and by the Austrian Fonds zur Förderung der wissenschaftlichen Forschung.  相似文献   

4.
An error variance approach to two-mode hierarchical clustering   总被引:2,自引:2,他引:0  
A new agglomerative method is proposed for the simultaneous hierarchical clustering of row and column elements of a two-mode data matrix. The procedure yields a nested sequence of partitions of the union of two sets of entities (modes). A two-mode cluster is defined as the union of subsets of the respective modes. At each step of the agglomerative process, the algorithm merges those clusters whose fusion results in the smallest possible increase in an internal heterogeneity measure. This measure takes into account both the variance within the respective cluster and its centroid effect defined as the squared deviation of its mean from the maximum entry in the input matrix. The procedure optionally yields an overlapping cluster solution by assigning further row and/or column elements to clusters existing at a preselected hierarchical level. Applications to real data sets drawn from consumer research concerning brand-switching behavior and from personality research concerning the interaction of behaviors and situations demonstrate the efficacy of the method at revealing the underlying two-mode similarity structure.  相似文献   

5.
Clustering techniques are based upon a dissimilarity or distance measure between objects and clusters. This paper focuses on the simplex space, whose elements??compositions??are subject to non-negativity and constant-sum constraints. Any data analysis involving compositions should fulfill two main principles: scale invariance and subcompositional coherence. Among fuzzy clustering methods, the FCM algorithm is broadly applied in a variety of fields, but it is not well-behaved when dealing with compositions. Here, the adequacy of different dissimilarities in the simplex, together with the behavior of the common log-ratio transformations, is discussed in the basis of compositional principles. As a result, a well-founded strategy for FCM clustering of compositions is suggested. Theoretical findings are accompanied by numerical evidence, and a detailed account of our proposal is provided. Finally, a case study is illustrated using a nutritional data set known in the clustering literature.  相似文献   

6.
A clustering that consists of a nested set of clusters may be represented graphically by a tree. In contrast, a clustering that includes non-nested overlapping clusters (sometimes termed a “nonhierarchical” clustering) cannot be represented by a tree. Graphical representations of such non-nested overlapping clusterings are usually complex and difficult to interpret. Carroll and Pruzansky (1975, 1980) suggested representing non-nested clusterings with multiple ultrametric or additive trees. Corter and Tversky (1986) introduced the extended tree (EXTREE) model, which represents a non-nested structure as a tree plus overlapping clusters that are represented by marked segments in the tree. We show here that the problem of finding a nested (i.e., tree-structured) set of clusters in an overlapping clustering can be reformulated as the problem of finding a clique in a graph. Thus, clique-finding algorithms can be used to identify sets of clusters in the solution that can be represented by trees. This formulation provides a means of automatically constructing a multiple tree or extended tree representation of any non-nested clustering. The method, called “clustrees”, is applied to several non-nested overlapping clusterings derived using the MAPCLUS program (Arabie and Carroll 1980).  相似文献   

7.
ConsiderN entities to be classified, with given weights, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity in that cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We consider the problems of finding maximum split partitions with exactlyM clusters and with at mostM clusters subject to the additional constraint that the sum of the weights of the entities in each cluster never exceeds a given bound. These two problems are shown to be NP-hard and reducible to a sequence of bin-packing problems. A (N 2) algorithm for the particular caseM =N of the second problem is also presented. Computational experience is reported.Acknowledgments: Work of the first author was supported in part by AFOSR grants 0271 and 0066 to Rutgers University and was done in part during a visit to GERAD, Ecole Polytechnique de Montréal, whose support is gratefully acknowledged. Work of the second and third authors was supported by NSERC grant GP0036426 and by FCAR grant 89EQ4144. We are grateful to Silvano Martello and Paolo Toth for making available to us their program MTP for the bin-paking problem and to three anonymous referees for comments which helped to improve the presentation of the paper.  相似文献   

8.
X is the automatic hierarchical classification of one mode (units or variables or occasions) of X on the basis of the other two. In this paper the case of OMC of units according to variables and occasions is discussed. OMC is the synthesis of a set of hierarchical classifications Delta obtained from X; e.g., the OMC of units is the consensus (synthesis) among the set of dendograms individually defined by clustering units on the basis of variables, separately for each given occasion of X. However, because Delta is often formed by a large number of classifications, it may be unrealistic that a single synthesis is representative of the entire set. In this case, subsets of similar (homegeneous) dendograms may be found in Delta so that a consensus representative of each subset may be identified. This paper proposes, PARtition and Least Squares Consensus cLassifications Analysis (PARLSCLA) of a set of r hierarchical classifications Delta. PARLSCLA identifies the best least-squares partition of Delta into m (1 <= m <= r) subsets of homogeneous dendograms and simultaneously detects the closest consensus classification (a median classification called Least Squares Consensus Dendogram (LSCD) for each subset. PARLSCLA is a generalization of the problem to find a least-squares consensus dendogram for Delta. PARLSCLA is formalized as a mixed-integer programming problem and solved with an iterative, two-step algorithm. The method proposed is applied to an empirical data set.  相似文献   

9.
Piecewise hierarchical clustering   总被引:1,自引:0,他引:1  
We consider two or more ultrametric distance matrices defined over different, possibly overlapping, subsets. These matrices are merged into one ultrametric matrix defined over the whole set. Necessary and sufficient conditions for uniqueness of the merging are established. when these conditions are not satisfied, consistent algorithms are given.The author thanks B. Monjardet and two anonymous referees for their helpful comments.  相似文献   

10.
Two algorithms for pyramidal classification — a generalization of hierarchical classification — are presented that can work with incomplete dissimilarity data. These approaches — a modification of the pyramidal ascending classification algorithm and a least squares based penalty method — are described and compared using two different types of complete dissimilarity data in which randomly chosen dissimilarities are assumed missing and the non-missing ones are subjected to random error. We also consider relationships between hierarchical classification and pyramidal classification solutions when both are based on incomplete dissimilarity data.  相似文献   

11.
When clustering asymmetric proximity data, only the average amounts are often considered by assuming that the asymmetry is due to noise. But when the asymmetry is structural, as typically may happen for exchange flows, migration data or confusion data, this may strongly affect the search for the groups because the directions of the exchanges are ignored and not integrated in the clustering process. The clustering model proposed here relies on the decomposition of the asymmetric dissimilarity matrix into symmetric and skew-symmetric effects both decomposed in within and between cluster effects. The classification structures used here are generally based on two different partitions of the objects fitted to the symmetric and the skew-symmetric part of the data, respectively; the restricted case is also presented where the partition fits jointly both of them allowing for clusters of objects similar with respect to the average amounts and directions of the data. Parsimonious models are presented which allow for effective and simple graphical representations of the results.  相似文献   

12.
This paper proposes a maximum clustering similarity (MCS) method for determining the number of clusters in a data set by studying the behavior of similarity indices comparing two (of several) clustering methods. The similarity between the two clusterings is calculated at the same number of clusters, using the indices of Rand (R), Fowlkes and Mallows (FM), and Kulczynski (K) each corrected for chance agreement. The number of clusters at which the index attains its maximum is a candidate for the optimal number of clusters. The proposed method is applied to simulated bivariate normal data, and further extended for use in circular data. Its performance is compared to the criteria discussed in Tibshirani, Walther, and Hastie (2001). The proposed method is not based on any distributional or data assumption which makes it widely applicable to any type of data that can be clustered using at least two clustering algorithms.  相似文献   

13.
Block-Relaxation Approaches for Fitting the INDCLUS Model   总被引:1,自引:1,他引:0  
A well-known clustering model to represent I?×?I?×?J data blocks, the J frontal slices of which consist of I?×?I object by object similarity matrices, is the INDCLUS model. This model implies a grouping of the I objects into a prespecified number of overlapping clusters, with each cluster having a slice-specific positive weight. An INDCLUS model is fitted to a given data set by means of minimizing a least squares loss function. The minimization of this loss function has appeared to be a difficult problem for which several algorithmic strategies have been proposed. At present, the best available option seems to be the SYMPRES algorithm, which minimizes the loss function by means of a block-relaxation algorithm. Yet, SYMPRES is conjectured to suffer from a severe local optima problem. As a way out, based on theoretical results with respect to optimally designing block-relaxation algorithms, five alternative block-relaxation algorithms are proposed. In a simulation study it appears that the alternative algorithms with overlapping parameter subsets perform best and clearly outperform SYMPRES in terms of optimization performance and cluster recovery.  相似文献   

14.
In this paper, we propose a bicriterion objective function for clustering a given set ofN entities, which minimizes [d–(1–)s], where 01, andd ands are the diameter and the split of the clustering, respectively. When =1, the problem reduces to minimum diameter clustering, and when =0, maximum split clustering. We show that this objective provides an effective way to compromise between the two often conflicting criteria. While the problem is NP-hard in general, a polynomial algorithm with the worst-case time complexityO(N 2) is devised to solve the bipartition version. This algorithm actually gives all the Pareto optimal bipartitions with respect to diameter and split, and it can be extended to yield an efficient divisive hierarchical scheme. An extension of the approach to the objective [(d 1+d 2)–2(1–)s] is also proposed, whered 1 andd 2 are diameters of the two clusters of a bipartition.This research was supported in part by the National Science and Engineering Research Council of Canada (Grant OGP 0104900). The authors wish to thank two anonymous referees, whose detailed comments on earlier drafts improved the paper.  相似文献   

15.
One key point in cluster analysis is to determine a similarity or dissimilarity measure between data objects. When working with time series, the concept of similarity can be established in different ways. In this paper, several non-parametric statistics originally designed to test the equality of the log-spectra of two stochastic processes are proposed as dissimilarity measures between time series data. Their behavior in time series clustering is analyzed throughout a simulation study, and compared with the performance of several model-free and model-based dissimilarity measures. Up to three different classification settings were considered: (i) to distinguish between stationary and non-stationary time series, (ii) to classify different ARMA processes and (iii) to classify several non-linear time series models. As it was expected, the performance of a particular dissimilarity metric strongly depended on the type of processes subjected to clustering. Among all the measures studied, the nonparametric distances showed the most robust behavior.  相似文献   

16.
Consider N entities to be classified (e.g., geographical areas), a matrix of dissimilarities between pairs of entities, a graph H with vertices associated with these entities such that the edges join the vertices corresponding to contiguous entities. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside of it. The single-linkage algorithm (ignoring contiguity between entities) provides partitions into M clusters for which the smallest split of the clusters, called split of the partition, is maximum. We study here the partitioning of the set of entities into M connected clusters for all M between N - 1 and 2 (i.e., clusters such that the subgraphs of H induced by their corresponding sets of entities are connected) with maximum split subject to that condition. We first provide an exact algorithm with a (N2) complexity for the particular case in which H is a tree. This algorithm suggests in turn a first heuristic algorithm for the general problem. Several variants of this heuristic are Also explored. We then present an exact algorithm for the general case based on iterative determination of cocycles of subtrees and on the solution of auxiliary set covering problems. As solution of the latter problems is time-consuming for large instances, we provide another heuristic in which the auxiliary set covering problems are solved approximately. Computational results obtained with the exact and heuristic algorithms are presented on test problems from the literature.  相似文献   

17.
术语定义的领域聚类是一项较新的研究课题。本文采用自下而上的层级聚类的方法,基于知网进行语义相似度计算,并根据不同词类对领域区分的贡献度以及构建领域聚类特有的停用词表来进行聚类的特征项选取,实现了术语定义的领域聚类。实验取得了较好的聚类结果。  相似文献   

18.
Minimum sum of diameters clustering   总被引:1,自引:1,他引:0  
The problem of determining a partition of a given set ofN entities intoM clusters such that the sum of the diameters of these clusters is minimum has been studied by Brucker (1978). He proved that it is NP-complete forM3 and mentioned that its complexity was unknown forM=2. We provide anO(N 3 logN) algorithm for this latter case. Moreover, we show that determining a partition into two clusters which minimizes any given function of the diameters can be done inO(N 5) time.Acknowledgments: This research was supported by the Air Force Office of Scientific Research Grant AFOSR 0271 to Rutgers University. We are grateful to Yves Crama for several insightful remarks and to an anonymous referee for detailed comments.  相似文献   

19.
The problem of measuring the impact of individual data points in a cluster analysis is examined. The purpose is to identify those data points that have an influence on the resulting cluster partitions. Influence of a single data point is considered present when different cluster partitions result from the removal of the element from the data set. The Hubert and Arabie (1985) corrected Rand index was used to provide numerical measures of influence of a data point. Simulated data sets consisting of a variety of cluster structures and error conditions were generated to validate the influence measures. The results showed that the measure of internal influence was 100% accurate in identifying those data elements exhibiting an influential effect. The nature of the influence, whether beneficial or detrimental to the clustering, can be evaluated with the use of the gamma and point-biserial statistics.  相似文献   

20.
This paper proposes a measure of spatial homogeneity for sets of d-dimensional points based on nearest neighbor distances. Tests for spatial uniformity are examined which assess the tendency of the entire data set to aggregate and evaluate the character of individual clusters. The sizes and powers of three statistical tests of uniformity against aggregation, regularity, and unimodality are studied to determine robustness. The paper also studies the effects of normalization and incorrect prior information. A percentile frame sampling procedure is proposed that does not require a sampling window but is superior to a toroidal frame and to buffer zone sampling in particular situations. Examples test two data sets for homogeneity and search the results of a hierarchical clustering for homogeneous clusters.This work was partially supported by NSF Grant ECS-8300204.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号