首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 304 毫秒
1.
A cluster diagram is a rooted planar tree that depicts the hierarchical agglomeration of objects into groups of increasing size. On the null hypothesis that at each stage of the clustering procedure all possible joins are equally probable, we derive the probability distributions for two properties of these diagrams: (1)S, the number of single objects previously ungrouped that are joined in the final stages of clustering, and (2)m k, the number of groups ofk+1 objects that are formed during the process. Ecological applications of statistical tests for these properties are described and illustrated with data from weed communities of Saskatchewan fields.This work was supported by the Natural Sciences and Engineering Research Council of Canada.  相似文献   

2.
Dendrograms used in data analysis are ultrametric spaces, hence objects of nonarchimedean geometry. It is known that there exist p-adic representations of dendrograms. Completed by a point at infinity, they can be viewed as subtrees of the Bruhat-Tits tree associated to the p-adic projective line. The implications are that certain moduli spaces known in algebraic geometry are in fact p-adic parameter spaces of dendrograms, and stochastic classification can also be handled within this framework. At the end, we calculate the topology of the hidden part of a dendrogram.  相似文献   

3.
Efficient algorithms for agglomerative hierarchical clustering methods   总被引:11,自引:4,他引:7  
Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n 3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n 3) worst-case time, can reasonably be expected to exhibit 0(n 2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n 2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n 2) time and space.Whenevern objects are characterized byk-tuples of real numbers, they may be clustered by any of a family of centroid SAHN clustering methods. These methods are based on a geometric model in which clusters are represented by points ink-dimensional real space and points being agglomerated are replaced by a single (centroid) point. For this model, we have solved a class of special packing problems involving point-symmetric convex objects and have exploited it to design an efficient centroid clustering algorithm. Specifically, we describe a centroid SAHN clustering algorithm that requires 0(n 2) time, in the worst case, for fixedk and for a family of dissimilarity measures including the Manhattan, Euclidean, Chebychev and all other Minkowski metrics.This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and by the Austrian Fonds zur Förderung der wissenschaftlichen Forschung.  相似文献   

4.
Parameters are derived of distributions of three coefficients of similarity between pairs (dyads) of operational taxonomic units for multivariate binary data (presence/absence of attributes) under statistical independence. These are applied to test independence for dyadic data. Association among attributes within operational taxonomic units is allowed. It is also permissible for the two units in the dyad to be drawn from different populations having different presence probabilities of attributes. The variance of the distribution of the similarity coefficients under statistical independence is shown to be relatively large in many empirical situations. This result implies that the practical interpretation of these coefficients requires much care. An application using the Jaccard index is given for the assessment of consensus between psychotherapists and their clients.
La distribution des coefficients de similarité pour les données binaires et les attributs associés
Résumé Les paramètres de la distribution de trois coefficients de similarité entre paires d'éléments taxinomiques opérationels de données multivariables binaires (présence/absence) ont été dérivés dans l'hypothèse d'indépendance statistique. Ces paramètres sont utilisés dans un test d'indépendance pour les données dyadiques. L'existence est autorisée, dans la population d'éléments, d'une association entre plusieurs attributs. Il est également permis que les deux éléments de la dyade soient tirés de deux populations différentes, ayant différentes probabilit és quant à la présence des attributs. Dans beaucoup de situations empiriques, la variance des coefficients de similarité peut être relativement élevée dans le cas d'indépendance statistique. Par conséquence, ces coefficients doivent être interprétés avec précaution. Un exemple est donné pour le coefficient de Jaccard, qui a été employé dans une recherche sur la concordance entre des psychothérapeutes et leurs clients.
  相似文献   

5.
Analytic procedures for classifying objects are commonly based on the product-moment correlation as a measure of object similarity. This statistic, however, generally does not represent an invariant index of similarity between two objects if they are measured along different bipolar variables where the direction of measurement for each variable is arbitrary. A computer simulation study compared Cohen's (1969) proposed solution to the problem, the invariant similarity coefficientr c , with the mean product-moment correlation based on all possible changes in the measurement direction of individual variables within a profile of scores. The empirical observation thatr c approaches the mean product-moment correlation with increases in the number of scores in the profiles was interpreted as encouragement for the use ofr c in classification research. Some cautions regarding its application were noted.This research was supported by the Social Sciences and Humanities Research Council of Canada, Grant no. 410-83-0633, and by the University of Toronto.  相似文献   

6.
Givenk rooted binary treesA 1, A2, ..., Ak, with labeled leaves, we generateC, a unique system of lineage constraints on common ancestors. We then present an algorithm for constructing the set of rooted binary treesB, compatible with all ofA 1, A2, ..., Ak. The running time to obtain one such supertree isO(k 2 n2), wheren is the number of distinct leaves in all of the treesA 1, A2, ..., Ak.  相似文献   

7.
Many similarity coefficients for binary data are defined as fractions. For certain resemblance measures the denominator may become zero. If the denominator is zero the value of the coefficient is indeterminate. It is shown that the seriousness of the indeterminacy problem differs with the resemblance measures. Following Batagelj and Bren (1995) we remove the indeterminacies by defining appropriate values in critical cases. The author would like to thank three anonymous reviewers for their helpful comments and valuable suggestions on earlier versions of this article.  相似文献   

8.
9.
k consisting of k clusters, with k > 2. Bottom-up agglomerative approaches are also commonly used to construct partitions, and we discuss these in terms of worst-case performance for metric data sets. Our main contribution derives from a new restricted partition formulation that requires each cluster to be an interval of a given ordering of the objects being clustered. Dynamic programming can optimally split such an ordering into a partition Pk for a large class of objectives that includes min-diameter. We explore a variety of ordering heuristics and show that our algorithm, when combined with an appropriate ordering heuristic, outperforms traditional algorithms on both random and non-random data sets.  相似文献   

10.
Probabilistic feature models (PFMs) can be used to explain binary rater judgements about the associations between two types of elements (e.g., objects and attributes) on the basis of binary latent features. In particular, to explain observed object-attribute associations PFMs assume that respondents classify both objects and attributes with respect to a, usually small, number of binary latent features, and that the observed object-attribute association is derived as a specific mapping of these classifications. Standard PFMs assume that the object-attribute association probability is the same according to all respondents, and that all observations are statistically independent. As both assumptions may be unrealistic, a multilevel latent class extension of PFMs is proposed which allows objects and/or attribute parameters to be different across latent rater classes, and which allows to model dependencies between associations with a common object (attribute) by assuming that the link between features and objects (attributes) is fixed across judgements. Formal relationships with existing multilevel latent class models for binary three-way data are described. As an illustration, the models are used to study rater differences in product perception and to investigate individual differences in the situational determinants of anger-related behavior.  相似文献   

11.
Aggregation of equivalence relations   总被引:1,自引:1,他引:0  
Each ofn attributes partitions a set of items into equivalence classes. Aconsistent aggregator of then partitions is defined as an aggregate partition that satisfies an independence condition and a unanimity condition. It is shown that the class of consistent aggregators is precisely the class ofconjunctive aggregators. That is, for each consistent aggregator there is a nonempty subsetN of the attributes such that two items are equivalent in the aggregate partition if and only if they are equivalent with respect to each attribute inN.  相似文献   

12.
Optimal algorithms for comparing trees with labeled leaves   总被引:2,自引:1,他引:1  
LetR n denote the set of rooted trees withn leaves in which: the leaves are labeled by the integers in {1, ...,n}; and among interior vertices only the root may have degree two. Associated with each interior vertexv in such a tree is the subset, orcluster, of leaf labels in the subtree rooted atv. Cluster {1, ...,n} is calledtrivial. Clusters are used in quantitative measures of similarity, dissimilarity and consensus among trees. For anyk trees inR n , thestrict consensus tree C(T 1, ...,T k ) is that tree inR n containing exactly those clusters common to every one of thek trees. Similarity between treesT 1 andT 2 inR n is measured by the numberS(T 1,T 2) of nontrivial clusters in bothT 1 andT 2; dissimilarity, by the numberD(T 1,T 2) of clusters inT 1 orT 2 but not in both. Algorithms are known to computeC(T 1, ...,T k ) inO(kn 2) time, andS(T 1,T 2) andD(T 1,T 2) inO(n 2) time. I propose a special representation of the clusters of any treeT R n , one that permits testing in constant time whether a given cluster exists inT. I describe algorithms that exploit this representation to computeC(T 1, ...,T k ) inO(kn) time, andS(T 1,T 2) andD(T 1,T 2) inO(n) time. These algorithms are optimal in a technical sense. They enable well-known indices of consensus between two trees to be computed inO(n) time. All these results apply as well to comparable problems involving unrooted trees with labeled leaves.The Natural Sciences and Engineering Research Council of Canada partially supported this work with grant A-4142.  相似文献   

13.
The method of nearest-neighbor interchange effects local improvements in a binary tree by replacing a 4-subtree by one of its two alternatives if this improves the objective function. We extend this to k-subtrees to reduce the number of local optima. Possible sequences of k-subtrees to be examined are produced by moving a window over the tree, incorporating one edge at a time while deactivating another. The direction of this movement is chosen according to a hill-climbing strategy. The algorithm includes a backtracking component. Series of simulations of molecular evolution data/parsimony analysis are carried out, fork=4, ..., 8, contrasting the hill-climbing strategy to one based on a random choice of next window, and comparing two stopping rules. Increasing window sizek is found to be the most effective way of improving the local optimum, followed by the choice of hill-climbing over the random strategy. A suggestion for achieving higher values ofk is based on a recursive use of the hill-climbing strategy.Acknowledgments: This work was supported in part by grants to the first author from the Natural Sciences and Engineering Research Council (Canada) and theFonds pour la formation de chercheurs et l'aide à la recherche (Québec), and to the third author from the Danish Research Council. The first author is a fellow of the Canadian Institute for Advanced Research. Much of the research was carried out in the spring of 1991 while the first author was visiting the University of Geneva; warmest thanks are due Professor Claude Weber for this opportunity.  相似文献   

14.
Classification Using Class Cover Catch Digraphs   总被引:2,自引:0,他引:2  
class cover catch digraphs based on proximity between training observations. Performance comparisons are presented on synthetic and real examples versus k-nearest neighbors, Fisher's linear discriminant and support vector machines. We demonstrate that the proposed semiparametric classifier has performance approaching that of the optimal parametric classifier in cases for which the optimal is available for comparison.  相似文献   

15.
The process of abstraction and concretisation is a label used for an explicative theory of scientific model-construction. In scientific theorising this process enters at various levels. We could identify two principal levels of abstraction that are useful to our understanding of theory-application. The first level is that of selecting a small number of variables and parameters abstracted from the universe of discourse and used to characterise the general laws of a theory. In classical mechanics, for example, we select position and momentum and establish a relation amongst the two variables, which we call Newton’s 2nd law. The specification of the unspecified elements of scientific laws, e.g. the force function in Newton’s 2nd law, is what would establish the link between the assertions of the theory and physical systems. In order to unravel how and with what conceptual resources scientific models are constructed, how they function and how they relate to theory, we need a view of theory-application that can accommodate our constructions of representation models. For this we need to expand our understanding of the process of abstraction to also explicate the process of specifying force functions etc. This is the second principal level at which abstraction enters in our theorising and in which I focus. In this paper, I attempt to elaborate a general analysis of the process of abstraction and concretisation involved in scientific- model construction, and argue why it provides an explication of the construction of models of the nuclear structure.  相似文献   

16.
The set of k points that optimally represent a distribution in terms of mean squared error have been called principal points (Flury 1990). Principal points are a special case of self-consistent points. Any given set of k distinct points in R p induce a partition of R p into Voronoi regions or domains of attraction according to minimal distance. A set of k points are called self-consistent for a distribution if each point equals the conditional mean of the distribution over its respective Voronoi region. For symmetric multivariate distributions, sets of self-consistent points typically form symmetric patterns. This paper investigates the optimality of different symmetric patterns of self-consistent points for symmetric multivariate distributions and in particular for the bivariate normal distribution. These results are applied to the problem of estimating principal points.  相似文献   

17.
It is shown that one can calculate the Hubert-Arabie adjusted Rand index by first forming the fourfold contingency table counting the number of pairs of objects that were placed in the same cluster in both partitions, in the same cluster in one partition but in different clusters in the other partition, and in different clusters in both, and then computing Cohen’s κ on this fourfold table. The author thanks Willem Heiser, Mark de Rooij, Marian Hickendorff and three anonymous reviewers for their helpful comments and valuable suggestions on earlier versions of this article. Published online xx, xx, xxxx.  相似文献   

18.
In two-class discriminant problems, objects are allocated to one of the two classes by means of threshold rules based on discriminant functions. In this paper we propose to examine the quality of a discriminant functiong in terms of its performance curve. This curve is the plot of the two misclassification probabilities as the thresholdt assumes various real values. The role of such performance curves in evaluating and ordering discriminant functions and solving discriminant problems is presented. In particular, it is shown that: (i) the convexity of such a curve is a sufficient condition for optimal use of the information contained in the data reduced byg, and (ii)g with non-convex performance curve should be corrected by an explicitly obtained transformation.  相似文献   

19.
NP-hard Approximation Problems in Overlapping Clustering   总被引:1,自引:1,他引:0  
Lp -norm (p < ∞). These problems also correspond to the approximation by a strongly Robinson dissimilarity or by a dissimilarity fulfilling the four-point inequality (Bandelt 1992; Diatta and Fichet 1994). The results are extended to circular strongly Robinson dissimilarities, indexed k-hierarchies (Jardine and Sibson 1971, pp. 65-71), and to proper dissimilarities satisfying the Bertrand and Janowitz (k + 2)-point inequality (Bertrand and Janowitz 1999). Unidimensional scaling (linear or circular) is reinterpreted as a clustering problem and its hardness is established, but only for the L 1 norm.  相似文献   

20.
Suppose y, a d-dimensional (d ≥ 1) vector, is drawn from a mixture of k (k ≥ 2) populations, given by ∏1, ∏2,…,∏ k . We wish to identify the population that is the most likely source of the point y. To solve this classification problem many classification rules have been proposed in the literature. In this study, a new nonparametric classifier based on the transvariation probabilities of data depth is proposed. We compare the performance of the newly proposed nonparametric classifier with classical and maximum depth classifiers using some benchmark and simulated data sets. The authors thank the editor and referees for comments that led to an improvement of this paper. This work is partially supported by the National Science Foundation under Grant No. DMS-0604726. Published online xx, xx, xxxx.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号