首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 59 毫秒
1.
Minimum sum of diameters clustering   总被引:1,自引:1,他引:0  
The problem of determining a partition of a given set ofN entities intoM clusters such that the sum of the diameters of these clusters is minimum has been studied by Brucker (1978). He proved that it is NP-complete forM3 and mentioned that its complexity was unknown forM=2. We provide anO(N 3 logN) algorithm for this latter case. Moreover, we show that determining a partition into two clusters which minimizes any given function of the diameters can be done inO(N 5) time.Acknowledgments: This research was supported by the Air Force Office of Scientific Research Grant AFOSR 0271 to Rutgers University. We are grateful to Yves Crama for several insightful remarks and to an anonymous referee for detailed comments.  相似文献   

2.
ConsiderN entities to be classified, with given weights, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity in that cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We consider the problems of finding maximum split partitions with exactlyM clusters and with at mostM clusters subject to the additional constraint that the sum of the weights of the entities in each cluster never exceeds a given bound. These two problems are shown to be NP-hard and reducible to a sequence of bin-packing problems. A (N 2) algorithm for the particular caseM =N of the second problem is also presented. Computational experience is reported.Acknowledgments: Work of the first author was supported in part by AFOSR grants 0271 and 0066 to Rutgers University and was done in part during a visit to GERAD, Ecole Polytechnique de Montréal, whose support is gratefully acknowledged. Work of the second and third authors was supported by NSERC grant GP0036426 and by FCAR grant 89EQ4144. We are grateful to Silvano Martello and Paolo Toth for making available to us their program MTP for the bin-paking problem and to three anonymous referees for comments which helped to improve the presentation of the paper.  相似文献   

3.
Maximum sum-of-splits clustering   总被引:1,自引:1,他引:0  
ConsiderN entities to be classified, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We study here the average split of the clusters or, equivalently, the sum of splits. A (N 2) algorithm is provided to determine maximum sum-of-splits partitions intoM clusters for allM betweenN – 1 and 2, using the dual graph of the single-linkage dendrogram.
Résumé SoientN objets à classifier et une matrice de dissimilarit és entre paires de ces objets. L'écart d'une classe est la plus petite dissimilarité entre un objet de cette classe et un objet en dehors d'elle. L'algorithme du lien simple fournit des partitions enM classes dont le plus petit écart est maximum. On étudie l'écart moyen des classes, ou, ce qui est équivalent, la somme des écarts. On propose un algorithme en (N 2) pour déterminer des partitions enM classes dont la somme des écarts est maximum pourM allant deN – 1 à 2, basé sur le graphe dual du dendrogramme de la méthode du lien simple.
  相似文献   

4.
Divisive hierarchical clustering algorithms with the diameter criterion proceed by recursively selecting the cluster with largest diameter and partitioning it into two clusters whose largest diameter is smallest possible. We provide two such algorithms with complexitiesO( N 2) andO(N 2logN) respectively, where denotes the maximum number of clusters in a partition andN the number of entities to be clustered. The former algorithm, an efficient implementation of an algorithm of Hubert, allows to find all partitions into at most clusters and is inO(N 2) for fixed . Moreover, if in each partitioning the size of the largest cluster is bounded byp times the number of entities in the set to be partitioned, with 1/2<=p<1, it provides a complete hierarchy of partitionsO(N 2 logN) time. The latter algorithm, a refinement of an algorithm of Rao allows to build a complete hierarchy of partitions inO(N 2 logN) time without any restriction. Comparative computational experiments with both algorithms and with an agglomerative hierarchical algorithm of Benzécri are reported.
Résumé Les algorithmes de classification hiérarchique descendante utilisant le critère du diamètre, sélectionnent récursivement la classe de plus grand diamètre et la partitionnent en deux classes, dont le plus grand diamètre est le plus, petit possible. Nous proposons deux tels algorithmes, avec des complexités enO ( N2) etO(N 2 logN) respectivement, où désigne le nombre maximum de classes d'une partition etN le nombre d'objets à classifier. Le premier algorithme, une implantation d'un algorithme de Hubert, permet de construire des partitions avec au plus classes et est enO(N 2) pour fixé. De plus, si dans chaque bipartition le nombre d'objets de la plus grande classe, est borné parp fois le nombre d'objets de l'ensemble à partitionner, où 1/2≤p<1, cet algorithme permet de construire une hiérarchie complète de partitions en tempsO(N 2 logN). Le second algorithme, un raffinement d'un algorithme de Rao, permet de construire une hiérarchie complète de partitions en tempsO(N 2 logN) sans aucune restriction On présente également des résultats de calcul comparatifs pour les deux algorithmes et pour l'algorithme de classification hiérarchique ascendante de Benzécri.
  相似文献   

5.
On some significance tests in cluster analysis   总被引:1,自引:1,他引:0  
We investigate the properties of several significance tests for distinguishing between the hypothesisH of a homogeneous population and an alternativeA involving clustering or heterogeneity, with emphasis on the case of multidimensional observationsx 1, ...,x n p . Four types of test statistics are considered: the (s-th) largest gap between observations, their mean distance (or similarity), the minimum within-cluster sum of squares resulting from a k-means algorithm, and the resulting maximum F statistic. The asymptotic distributions underH are given forn and the asymptotic power of the tests is derived for neighboring alternatives.  相似文献   

6.
A class of (multiple) consensus methods for n-trees (dendroids, hierarchical classifications) is studied. This class constitutes an extension of the so-called median consensus in the sense that we get two numbersm andm such that: If a clusterX occurs ink n-trees of a profileP, withk m, then it occurs in every consensus n-tree ofP. IfX occurs ink n-trees ofP, withm k <m, then it may, or may not, belong to a consensus n-tree ofP. IfX occurs ink n-trees ofP, withk <m then it cannot occur in any consensus n-tree ofP. If these conditions are satisfied, the multiconsensus function is said to be thresholded by the pair (m,m). Two results are obtained. The first one characterizes the pairs of numbers that can be viewed as thresholds for some consensus function. The second one provides a characterization of thresholded consensus methods. As an application a characterization of the quota rules is provided.
Resume Cet article traite d'une classe de méthodes de consensus (multiples) entre des classifications hiérarchiques. Cette classe est une généralisation du consensus médian dans las mesure oú elle est constituée des méthodes c pour lesquelles il existe deux nombresm etm tels que: Si une classeX appartient ák hiérarchies d'un profilP, aveck m, alorsX appartient á chaque hiérarchie consensus deP. SiX appartient ák hiérarchies deP, avecm k <m, alorsX, peut, ou non, appartenir à une hiérarchie consensus deP. SiX appartient àk hiérarchies deP, aveck <m, alorsX n'appartient á aucune hiérarchie consensus deP. On dit alors que le couple (m,m) est un seuil pour c. Deux résultats sont obtenus. Le premier caractérise les couples de nombres qui sont des seuils de consensus. Le second caractérise les consensus admettant un seuil. Une caractérisation de la régle des quotas est déduite de ce second résultat.
  相似文献   

7.
In this paper, we propose a bicriterion objective function for clustering a given set ofN entities, which minimizes [d–(1–)s], where 01, andd ands are the diameter and the split of the clustering, respectively. When =1, the problem reduces to minimum diameter clustering, and when =0, maximum split clustering. We show that this objective provides an effective way to compromise between the two often conflicting criteria. While the problem is NP-hard in general, a polynomial algorithm with the worst-case time complexityO(N 2) is devised to solve the bipartition version. This algorithm actually gives all the Pareto optimal bipartitions with respect to diameter and split, and it can be extended to yield an efficient divisive hierarchical scheme. An extension of the approach to the objective [(d 1+d 2)–2(1–)s] is also proposed, whered 1 andd 2 are diameters of the two clusters of a bipartition.This research was supported in part by the National Science and Engineering Research Council of Canada (Grant OGP 0104900). The authors wish to thank two anonymous referees, whose detailed comments on earlier drafts improved the paper.  相似文献   

8.
9.
The median procedure for n-trees   总被引:2,自引:2,他引:0  
Let (X,d) be a metric space The functionM:X k 2 x defined by is the minimum } is called themedian procedure and has been found useful in various applications involving the notion of consensus Here we present axioms that characterizeM whenX is a certain class of trees (hierarchical classifications), andd is the symmetric difference metricWe would like to thank the referees and Editor for helpful comments  相似文献   

10.
Consider N entities to be classified (e.g., geographical areas), a matrix of dissimilarities between pairs of entities, a graph H with vertices associated with these entities such that the edges join the vertices corresponding to contiguous entities. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside of it. The single-linkage algorithm (ignoring contiguity between entities) provides partitions into M clusters for which the smallest split of the clusters, called split of the partition, is maximum. We study here the partitioning of the set of entities into M connected clusters for all M between N - 1 and 2 (i.e., clusters such that the subgraphs of H induced by their corresponding sets of entities are connected) with maximum split subject to that condition. We first provide an exact algorithm with a (N2) complexity for the particular case in which H is a tree. This algorithm suggests in turn a first heuristic algorithm for the general problem. Several variants of this heuristic are Also explored. We then present an exact algorithm for the general case based on iterative determination of cocycles of subtrees and on the solution of auxiliary set covering problems. As solution of the latter problems is time-consuming for large instances, we provide another heuristic in which the auxiliary set covering problems are solved approximately. Computational results obtained with the exact and heuristic algorithms are presented on test problems from the literature.  相似文献   

11.
Clustering with a criterion which minimizes the sum of squared distances to cluster centroids is usually done in a heuristic way. An exact polynomial algorithm, with a complexity in O(N p+1 logN), is proposed for minimum sum of squares hierarchical divisive clustering of points in a p-dimensional space with small p. Empirical complexity is one order of magnitude lower. Data sets with N = 20000 for p = 2, N = 1000 for p = 3, and N = 200 for p = 4 are clustered in a reasonable computing time.  相似文献   

12.
13.
Optimal algorithms for comparing trees with labeled leaves   总被引:2,自引:1,他引:1  
LetR n denote the set of rooted trees withn leaves in which: the leaves are labeled by the integers in {1, ...,n}; and among interior vertices only the root may have degree two. Associated with each interior vertexv in such a tree is the subset, orcluster, of leaf labels in the subtree rooted atv. Cluster {1, ...,n} is calledtrivial. Clusters are used in quantitative measures of similarity, dissimilarity and consensus among trees. For anyk trees inR n , thestrict consensus tree C(T 1, ...,T k ) is that tree inR n containing exactly those clusters common to every one of thek trees. Similarity between treesT 1 andT 2 inR n is measured by the numberS(T 1,T 2) of nontrivial clusters in bothT 1 andT 2; dissimilarity, by the numberD(T 1,T 2) of clusters inT 1 orT 2 but not in both. Algorithms are known to computeC(T 1, ...,T k ) inO(kn 2) time, andS(T 1,T 2) andD(T 1,T 2) inO(n 2) time. I propose a special representation of the clusters of any treeT R n , one that permits testing in constant time whether a given cluster exists inT. I describe algorithms that exploit this representation to computeC(T 1, ...,T k ) inO(kn) time, andS(T 1,T 2) andD(T 1,T 2) inO(n) time. These algorithms are optimal in a technical sense. They enable well-known indices of consensus between two trees to be computed inO(n) time. All these results apply as well to comparable problems involving unrooted trees with labeled leaves.The Natural Sciences and Engineering Research Council of Canada partially supported this work with grant A-4142.  相似文献   

14.
Optimization Strategies for Two-Mode Partitioning   总被引:2,自引:2,他引:0  
Two-mode partitioning is a relatively new form of clustering that clusters both rows and columns of a data matrix. In this paper, we consider deterministic two-mode partitioning methods in which a criterion similar to k-means is optimized. A variety of optimization methods have been proposed for this type of problem. However, it is still unclear which method should be used, as various methods may lead to non-global optima. This paper reviews and compares several optimization methods for two-mode partitioning. Several known methods are discussed, and a new fuzzy steps method is introduced. The fuzzy steps method is based on the fuzzy c-means algorithm of Bezdek (1981) and the fuzzy steps approach of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performances of all methods are compared in a large simulation study. In our simulations, a two-mode k-means optimization method most often gives the best results. Finally, an empirical data set is used to give a practical example of two-mode partitioning. We would like to thank two anonymous referees whose comments have improved the quality of this paper. We are also grateful to Peter Verhoef for providing the data set used in this paper.  相似文献   

15.
This paper proposes a measure of spatial homogeneity for sets of d-dimensional points based on nearest neighbor distances. Tests for spatial uniformity are examined which assess the tendency of the entire data set to aggregate and evaluate the character of individual clusters. The sizes and powers of three statistical tests of uniformity against aggregation, regularity, and unimodality are studied to determine robustness. The paper also studies the effects of normalization and incorrect prior information. A percentile frame sampling procedure is proposed that does not require a sampling window but is superior to a toroidal frame and to buffer zone sampling in particular situations. Examples test two data sets for homogeneity and search the results of a hierarchical clustering for homogeneous clusters.This work was partially supported by NSF Grant ECS-8300204.  相似文献   

16.
A mathematical programming algorithm is developed for fitting ultrametric or additive trees to proximity data where external constraints are imposed on the topology of the tree. The two procedures minimize a least squares loss function. The method is illustrated on both synthetic and real data. A constrained ultrametric tree analysis was performed on similarities between 32 subjects based on preferences for ten odors, while a constrained additive tree analysis was carried out on some proximity data between kinship terms. Finally, some extensions of the methodology to other tree fitting procedures are mentioned.The first author is supported as Bevoegdverklaard Navorser of the Belgian Nationaal Fonds voor Wetenschappelijk Onderzoek.  相似文献   

17.
Ultrametric tree representations of incomplete dissimilarity data   总被引:2,自引:2,他引:0  
The least squares algorithm for fitting ultrametric trees to proximity data originally proposed by Carroll and Pruzansky and further elaborated by De Soete is extended to handle missing data. A Monte Carlo evaluation reveals that the algorithm is capable of recovering an ultrametric tree underlying an incomplete set of error-perturbed dissimilarities quite well.Geert De Soete is Aangesteld Navorser of the Belgian National Fonds voor Wetenschappelijk Onderzoek.  相似文献   

18.
The “DNA is a program” metaphor is still widely used in Molecular Biology and its popularization. There are good historical reasons for the use of such a metaphor or theoretical model. Yet we argue that both the metaphor and the model are essentially inadequate also from the point of view of Physics and Computer Science. Relevant work has already been done, in Biology, criticizing the programming paradigm. We will refer to empirical evidence and theoretical writings in Biology, although our arguments will be mostly based on a comparison with the use of differential methods (in Molecular Biology: a mutation or alike is observed or induced and its phenotypic consequences are observed) as applied in Computer Science and in Physics, where this fundamental tool for empirical investigation originated and acquired a well-justified status. In particular, as we will argue, the programming paradigm is not theoretically sound as a causal(as in Physics) or deductive(as in Programming) framework for relating the genome to the phenotype, in contrast to the physicalist and computational grounds that this paradigm claims to propose.
Giuseppe LongoEmail: URL: http://www.di.ens.fr/users/longo
  相似文献   

19.
k-Adic formulations (for groups of objects of size k) of a variety of 2-adic similarity coefficients (for pairs of objects) for binary (presence/absence) data are presented. The formulations are not functions of 2-adic similarity coefficients. Instead, the main objective of the the paper is to present k-adic formulations that reflect certain basic characteristics of, and have a similar interpretation as, their 2-adic versions. Two major classes are distinguished. The first class is referred to as Bennani-Heiser similarity coefficients, which contains all coefficients that can be defined using just the matches, the number of attributes that are present and that are absent in k objects, and the total number of attributes. The coefficients in the second class can be formulated as functions of Dice’s association indices. The author thanks Willem Heiser and three anonymous reviewers for their helpful comments and valuable suggestions on earlier versions of this article.  相似文献   

20.
Interpreting a taxonomic tree as a set of objects leads to natural measures of complexity and similarity, and sets natural lower bounds on a consensus tree Interpretations differing as to the kind of objects constituting a tree lead to different measures and consensus Subset nesting is preferred over the clusters (strict consensus) and even the triads interpretations because of its superior expression of shared structure Algorithms for computing the complexity and similarity of trees, as well as a consensus index onto [0,1], are presented for this interpretation The full consensus is defined as the only tree which includes all the nestings shared in a profile of rival trees and whose clusters reflect only nestings shared in the profile The full consensus is proved to exist uniquely for each profile, and to equal the Adams consensusThe author is grateful for the many helpful comments on presentation from Frances McA Adams, William H E Day, and Christopher A Meacham  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号