首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 816 毫秒
1.
The median procedure for n-trees   总被引:2,自引:2,他引:0  
Let (X,d) be a metric space The functionM:X k 2 x defined by is the minimum } is called themedian procedure and has been found useful in various applications involving the notion of consensus Here we present axioms that characterizeM whenX is a certain class of trees (hierarchical classifications), andd is the symmetric difference metricWe would like to thank the referees and Editor for helpful comments  相似文献   

2.
Maximum sum-of-splits clustering   总被引:1,自引:1,他引:0  
ConsiderN entities to be classified, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We study here the average split of the clusters or, equivalently, the sum of splits. A (N 2) algorithm is provided to determine maximum sum-of-splits partitions intoM clusters for allM betweenN – 1 and 2, using the dual graph of the single-linkage dendrogram.
Résumé SoientN objets à classifier et une matrice de dissimilarit és entre paires de ces objets. L'écart d'une classe est la plus petite dissimilarité entre un objet de cette classe et un objet en dehors d'elle. L'algorithme du lien simple fournit des partitions enM classes dont le plus petit écart est maximum. On étudie l'écart moyen des classes, ou, ce qui est équivalent, la somme des écarts. On propose un algorithme en (N 2) pour déterminer des partitions enM classes dont la somme des écarts est maximum pourM allant deN – 1 à 2, basé sur le graphe dual du dendrogramme de la méthode du lien simple.
  相似文献   

3.
Data holders, such as statistical institutions and financial organizations, have a very serious and demanding task when producing data for official and public use. It’s about controlling the risk of identity disclosure and protecting sensitive information when they communicate data-sets among themselves, to governmental agencies and to the public. One of the techniques applied is that of micro-aggregation. In a Bayesian setting, micro-aggregation can be viewed as the optimal partitioning of the original data-set based on the minimization of an appropriate measure of discrepancy, or distance, between two posterior distributions, one of which is conditional on the original data-set and the other conditional on the aggregated data-set. Assuming d-variate normal data-sets and using several measures of discrepancy, it is shown that the asymptotically optimal equal probability m-partition of , with m 1/d ∈ , is the convex one which is provided by hypercubes whose sides are formed by hyperplanes perpendicular to the canonical axes, no matter which discrepancy measure has been used. On the basis of the above result, a method that produces a sub-optimal partition with a very small computational cost is presented. Published online xx, xx, xxxx.  相似文献   

4.
Minimum sum of diameters clustering   总被引:1,自引:1,他引:0  
The problem of determining a partition of a given set ofN entities intoM clusters such that the sum of the diameters of these clusters is minimum has been studied by Brucker (1978). He proved that it is NP-complete forM3 and mentioned that its complexity was unknown forM=2. We provide anO(N 3 logN) algorithm for this latter case. Moreover, we show that determining a partition into two clusters which minimizes any given function of the diameters can be done inO(N 5) time.Acknowledgments: This research was supported by the Air Force Office of Scientific Research Grant AFOSR 0271 to Rutgers University. We are grateful to Yves Crama for several insightful remarks and to an anonymous referee for detailed comments.  相似文献   

5.
ConsiderN entities to be classified, with given weights, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity in that cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We consider the problems of finding maximum split partitions with exactlyM clusters and with at mostM clusters subject to the additional constraint that the sum of the weights of the entities in each cluster never exceeds a given bound. These two problems are shown to be NP-hard and reducible to a sequence of bin-packing problems. A (N 2) algorithm for the particular caseM =N of the second problem is also presented. Computational experience is reported.Acknowledgments: Work of the first author was supported in part by AFOSR grants 0271 and 0066 to Rutgers University and was done in part during a visit to GERAD, Ecole Polytechnique de Montréal, whose support is gratefully acknowledged. Work of the second and third authors was supported by NSERC grant GP0036426 and by FCAR grant 89EQ4144. We are grateful to Silvano Martello and Paolo Toth for making available to us their program MTP for the bin-paking problem and to three anonymous referees for comments which helped to improve the presentation of the paper.  相似文献   

6.
In this paper, we propose a bicriterion objective function for clustering a given set ofN entities, which minimizes [d–(1–)s], where 01, andd ands are the diameter and the split of the clustering, respectively. When =1, the problem reduces to minimum diameter clustering, and when =0, maximum split clustering. We show that this objective provides an effective way to compromise between the two often conflicting criteria. While the problem is NP-hard in general, a polynomial algorithm with the worst-case time complexityO(N 2) is devised to solve the bipartition version. This algorithm actually gives all the Pareto optimal bipartitions with respect to diameter and split, and it can be extended to yield an efficient divisive hierarchical scheme. An extension of the approach to the objective [(d 1+d 2)–2(1–)s] is also proposed, whered 1 andd 2 are diameters of the two clusters of a bipartition.This research was supported in part by the National Science and Engineering Research Council of Canada (Grant OGP 0104900). The authors wish to thank two anonymous referees, whose detailed comments on earlier drafts improved the paper.  相似文献   

7.
A class of (multiple) consensus methods for n-trees (dendroids, hierarchical classifications) is studied. This class constitutes an extension of the so-called median consensus in the sense that we get two numbersm andm such that: If a clusterX occurs ink n-trees of a profileP, withk m, then it occurs in every consensus n-tree ofP. IfX occurs ink n-trees ofP, withm k <m, then it may, or may not, belong to a consensus n-tree ofP. IfX occurs ink n-trees ofP, withk <m then it cannot occur in any consensus n-tree ofP. If these conditions are satisfied, the multiconsensus function is said to be thresholded by the pair (m,m). Two results are obtained. The first one characterizes the pairs of numbers that can be viewed as thresholds for some consensus function. The second one provides a characterization of thresholded consensus methods. As an application a characterization of the quota rules is provided.
Resume Cet article traite d'une classe de méthodes de consensus (multiples) entre des classifications hiérarchiques. Cette classe est une généralisation du consensus médian dans las mesure oú elle est constituée des méthodes c pour lesquelles il existe deux nombresm etm tels que: Si une classeX appartient ák hiérarchies d'un profilP, aveck m, alorsX appartient á chaque hiérarchie consensus deP. SiX appartient ák hiérarchies deP, avecm k <m, alorsX, peut, ou non, appartenir à une hiérarchie consensus deP. SiX appartient àk hiérarchies deP, aveck <m, alorsX n'appartient á aucune hiérarchie consensus deP. On dit alors que le couple (m,m) est un seuil pour c. Deux résultats sont obtenus. Le premier caractérise les couples de nombres qui sont des seuils de consensus. Le second caractérise les consensus admettant un seuil. Une caractérisation de la régle des quotas est déduite de ce second résultat.
  相似文献   

8.
Clustering with a criterion which minimizes the sum of squared distances to cluster centroids is usually done in a heuristic way. An exact polynomial algorithm, with a complexity in O(N p+1 logN), is proposed for minimum sum of squares hierarchical divisive clustering of points in a p-dimensional space with small p. Empirical complexity is one order of magnitude lower. Data sets with N = 20000 for p = 2, N = 1000 for p = 3, and N = 200 for p = 4 are clustered in a reasonable computing time.  相似文献   

9.
10.
On some significance tests in cluster analysis   总被引:1,自引:1,他引:0  
We investigate the properties of several significance tests for distinguishing between the hypothesisH of a homogeneous population and an alternativeA involving clustering or heterogeneity, with emphasis on the case of multidimensional observationsx 1, ...,x n p . Four types of test statistics are considered: the (s-th) largest gap between observations, their mean distance (or similarity), the minimum within-cluster sum of squares resulting from a k-means algorithm, and the resulting maximum F statistic. The asymptotic distributions underH are given forn and the asymptotic power of the tests is derived for neighboring alternatives.  相似文献   

11.
A random sample of sizeN is divided intok clusters that minimize the within clusters sum of squares locally. Some large sample properties of this k-means clustering method (ask approaches withN) are obtained. In one dimension, it is established that the sample k-means clusters are such that the within-cluster sums of squares are asymptotically equal, and that the sizes of the cluster intervals are inversely proportional to the one-third power of the underlying density at the midpoints of the intervals. The difficulty involved in generalizing the results to the multivariate case is mentioned.This research was supported in part by the National Science Foundation under Grant MCS75-08374. The author would like to thank John Hartigan and David Pollard for helpful discussions and comments.  相似文献   

12.
Data in an experimental array where a nominal dependent variable hasm>2 outcomes may be accounted for by one of a number of possible schemes consisting ofJ successive and/or parallel independentm i-nomial experiments where m i =m +J – 1. Each such scheme can be represented by a tree diagram which is presumed to be valid everywhere in the array. A criterion based on likelihood is defined to assess the different schemes. The set of outcome probabilities of a scheme is shown to differ from that of all other schemes almost everywhere in the space of parameters. As sample size increases, the probability of correctly inferring the true tree tends to 1. Using Monte-Carlo simulation of the four-outcome case, we illustrate, for small sample sizes, how this probability depends on the parameters.
Résumé Une famille de modèles est proposée pour analyser un ensemble de données dont les observations sont faites sur une variable réponse discrète et sur un vecteur explicatif. Chaque modèle est constitué d'une série d'expériences multinomiales dont les résultats sont des regroupements de modalités de la variable réponse. Les probabilités d'observer ces regroupements dépendent du vecteur explicatif selon des équations logistique-lin éaires. On prouve facilement que chaque modèle de cette famille contient le même nombre de paramètres. De plus chaque modèle correspond à une structure d'arbres qui classifie hiérarchiquement les modalités de la variable réponse: un noeud non terminal de l'arbre représente une de ces expériences multinomiales et un noeud terminal représente une modalité.

Ainsi la probabilité d'observer une de ces modalités est calculée en parcourant le chemin reliant la racine au noeud terminal représentant cette modalité et le choix du modèle est basé sur un critère de vraisemblance calculée comme le produit des vraisemblances évaluées à partir de l'ensemble de données pour chaque noeud non terminal de l'arbre. On démontre que la capacité de prédiction des modalités diffère pour chaque arbre et seul le vrai arbre peut exhiber les vraies probabilités sur presque tout l'espace paramétrique. On y démontre aussi des propriétés asymptotiques du critère qui assurent que le vrai modèle est choisi par ce critère avec probabilité 1. Une étude par simulation Monte-Carlo illustre, dans le cas de petits échantillons, la dépendance de la probabilité que le vrai modèle soit choisi sur les valeurs des paramètres.
  相似文献   

13.
The Metric Cutpoint Partition Problem   总被引:1,自引:1,他引:0  
Let G = (V, E,w) be a graph with vertex and edge sets V and E, respectively, and w: E → a function which assigns a positive weight or length to each edge of G. G is called a realization of a finite metric space (M, d), with M = {1, ..., n} if and only if {1, ..., n} ⊆ V and d(i, j) is equal to the length of the shortest chain linking i and j in Gi, j = 1, ..., n. A realization G of (M, d), is called optimal if the sum of its weights is minimal among all the realizations of (M, d). A cutpoint in a graph G is a vertex whose removal strictly increases the number of connected components of G. The Metric Cutpoint Partition Problem is to determine if a finite metric space (M, d) has an optimal realization containing a cutpoint. We prove in this paper that this problem is polynomially solvable. We also describe an algorithm that constructs an optimal realization of (M, d) from optimal realizations of subspaces that do not contain any cutpoint. Supported by grant PA002-104974/2 from the Swiss National Science Foundation. Published online xx, xx, xxxx.  相似文献   

14.
The theorem of the paper Aggregation of Equivalence Relations, by Fishburn and Rubinstein, states a result already known. This theorem improves a result from Mirkin (1975) and appears as a corollary occurring in Leclerc (1984).
Resume L'unique théorème de l'article Aggregation of Equivalence Relations de Fishburn et Rubinstein est déja connu. Il améliore, en fait, un résultat de Mirkin (1975) et apparait en tant que corollaire dans Leclerc (1984).
  相似文献   

15.
This reply to Gash’s (Found Sci 2013) commentary on Nescolarde-Selva and Usó-Doménech (Found Sci 2013) answers the three questions raised and at the same time opens up new questions.  相似文献   

16.
Optimal algorithms for comparing trees with labeled leaves   总被引:2,自引:1,他引:1  
LetR n denote the set of rooted trees withn leaves in which: the leaves are labeled by the integers in {1, ...,n}; and among interior vertices only the root may have degree two. Associated with each interior vertexv in such a tree is the subset, orcluster, of leaf labels in the subtree rooted atv. Cluster {1, ...,n} is calledtrivial. Clusters are used in quantitative measures of similarity, dissimilarity and consensus among trees. For anyk trees inR n , thestrict consensus tree C(T 1, ...,T k ) is that tree inR n containing exactly those clusters common to every one of thek trees. Similarity between treesT 1 andT 2 inR n is measured by the numberS(T 1,T 2) of nontrivial clusters in bothT 1 andT 2; dissimilarity, by the numberD(T 1,T 2) of clusters inT 1 orT 2 but not in both. Algorithms are known to computeC(T 1, ...,T k ) inO(kn 2) time, andS(T 1,T 2) andD(T 1,T 2) inO(n 2) time. I propose a special representation of the clusters of any treeT R n , one that permits testing in constant time whether a given cluster exists inT. I describe algorithms that exploit this representation to computeC(T 1, ...,T k ) inO(kn) time, andS(T 1,T 2) andD(T 1,T 2) inO(n) time. These algorithms are optimal in a technical sense. They enable well-known indices of consensus between two trees to be computed inO(n) time. All these results apply as well to comparable problems involving unrooted trees with labeled leaves.The Natural Sciences and Engineering Research Council of Canada partially supported this work with grant A-4142.  相似文献   

17.
Consider N entities to be classified (e.g., geographical areas), a matrix of dissimilarities between pairs of entities, a graph H with vertices associated with these entities such that the edges join the vertices corresponding to contiguous entities. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside of it. The single-linkage algorithm (ignoring contiguity between entities) provides partitions into M clusters for which the smallest split of the clusters, called split of the partition, is maximum. We study here the partitioning of the set of entities into M connected clusters for all M between N - 1 and 2 (i.e., clusters such that the subgraphs of H induced by their corresponding sets of entities are connected) with maximum split subject to that condition. We first provide an exact algorithm with a (N2) complexity for the particular case in which H is a tree. This algorithm suggests in turn a first heuristic algorithm for the general problem. Several variants of this heuristic are Also explored. We then present an exact algorithm for the general case based on iterative determination of cocycles of subtrees and on the solution of auxiliary set covering problems. As solution of the latter problems is time-consuming for large instances, we provide another heuristic in which the auxiliary set covering problems are solved approximately. Computational results obtained with the exact and heuristic algorithms are presented on test problems from the literature.  相似文献   

18.
The “DNA is a program” metaphor is still widely used in Molecular Biology and its popularization. There are good historical reasons for the use of such a metaphor or theoretical model. Yet we argue that both the metaphor and the model are essentially inadequate also from the point of view of Physics and Computer Science. Relevant work has already been done, in Biology, criticizing the programming paradigm. We will refer to empirical evidence and theoretical writings in Biology, although our arguments will be mostly based on a comparison with the use of differential methods (in Molecular Biology: a mutation or alike is observed or induced and its phenotypic consequences are observed) as applied in Computer Science and in Physics, where this fundamental tool for empirical investigation originated and acquired a well-justified status. In particular, as we will argue, the programming paradigm is not theoretically sound as a causal(as in Physics) or deductive(as in Programming) framework for relating the genome to the phenotype, in contrast to the physicalist and computational grounds that this paradigm claims to propose.
Giuseppe LongoEmail: URL: http://www.di.ens.fr/users/longo
  相似文献   

19.

Reviewers

Guest Reviewers, Journal of Classification Volume 26, 2009  相似文献   

20.

Reviewers

Guest Reviewers, Journal of Classification Volume 28(1) 2011, Special Issue  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号