首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In agglomerative hierarchical clustering, pair-group methods suffer from a problem of non-uniqueness when two or more distances between different clusters coincide during the amalgamation process. The traditional approach for solving this drawback has been to take any arbitrary criterion in order to break ties between distances, which results in different hierarchical classifications depending on the criterion followed. In this article we propose a variable-group algorithm that consists in grouping more than two clusters at the same time when ties occur. We give a tree representation for the results of the algorithm, which we call a multidendrogram, as well as a generalization of the Lance andWilliams’ formula which enables the implementation of the algorithm in a recursive way. The authors thank A. Arenas for discussion and helpful comments. This work was partially supported by DGES of the Spanish Government Project No. FIS2006–13321–C02–02 and by a grant of Universitat Rovira i Virgili.  相似文献   

2.
3.
4.
5.
6.
The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. In this work we evaluate empirically this new approach to hierarchical clustering. We compare hierarchical clustering based on the Baire metric with (i) agglomerative hierarchical clustering, in terms of algorithm properties; (ii) generalized ultrametrics, in terms of definition; and (iii) fast clustering through k-means partitioning, in terms of quality of results. For the latter, we carry out an in depth astronomical study. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more costly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we use clusterwise regression for this.  相似文献   

7.
8.
9.
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.  相似文献   

10.
11.
由于生命物质的手性均一性,人们越来越认识到手性药物、农药、香料等的手征性对其生理作用的重要影响,科学地使用手性物质将提高人们的生活质量和改善生存环境。光学纯手性物质的获得,除了来自天然以外,人工合成是主要的途径。外消旋体拆分、底物诱导的不对称合成和催化不对称合成是获得光学纯手性物质的3种手段,其中催化不对称合成是最有效的方法,因为它很容易实现手性增殖。一个高效的催化剂可以产生成百上千乃至上百万个光学活性产物分子,同时最大限度地消除了无用异构体的生成。因此,催化不对称反应不但可以提供光学纯手性物质制备的关…  相似文献   

12.
我国的大型科学仪器共享水平与发达国家相比差距很大。其中,缺乏有效的共享模型用以指导构建完善的共享系统是重要原因之一。本文基于科技资源共享三层次模型,构建了包括可见、可及、可用三个层次的大型科学仪器共享三层次模型。结合发达国家研究与实验基地大型科学仪器的共享实例对三层次中的共享内容、共享策略、预约过程、实时监控和培训等进行了细化分析。  相似文献   

13.
This paper proposes a new way of overcoming the existing limitations. It generalizes the model used in the previous approaches by introducing a more comprehensive portfolio of covariance matrix structures. Further, this paper proposes a Bayesian solution in the presence of the noise in clustering problems. The performace of the proposed method is first studied by simulation; the procedure is also applied to the analysis of data concerning species of butterflies and diabetes patients.  相似文献   

14.
Clustering criteria for discrete data and latent class models   总被引:1,自引:0,他引:1  
We show that a well-known clustering criterion for discrete data, the information criterion, is closely related to the classification maximum likelihood criterion for the latent class model. This relation can be derived from the Bryant-Windham construction. Emphasis is placed on binary clustering criteria which are analyzed under the maximum likelihood approach for different multivariate Bernoulli mixtures. This alternative form of criterion reveals non-apparent aspects of clustering techniques. All the criteria discussed can be optimized with the alternating optimization algorithm. Some illustrative applications are included.
Résumé Nous montrons que le critère de classification de l'information, souvent utilisé pour les données discrètes, est très lié au critère du maximum de vraisemblance classifiante appliqué au modèle des classes latentes. Ce lien peut être analysé sous l'approche de la paramétrisation de Bryant-Windham. L'accent est mis sur le cas des données binaires qui sont analysées sous l'approche du maximum de vraisemblance pour les mélanges de distributions multivariées de Bernoulli. Cette forme de critère permet de mettre en évidence des aspects cachés des méthodes de classification de données binaires. Tous les critères envisagés ici peuvent être optimisés avec l'algorithme d'optimisation alternée. Des exemples concluent cet article.
  相似文献   

15.
We examine the problem of aggregating several partitions of a finite set into a single consensus partition We note that the dual concepts of clustering and isolation are especially significant in this connection. The hypothesis that a consensus partition should respect unanimity with respect to either concept leads us to stress a consensus interval rather than a single partition. The extremes of this interval are characterized axiomatically. If a sufficient totality of traits has been measured, and if measurement errors are independent, then a true classifying partition can be expected to lie in the consensus interval. The structure of the partitions in the interval lends itself to partial solutions of the consensus problem Conditional entropy may be used to quantify the uncertainty inherent in the interval as a whole  相似文献   

16.
Normal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum a posteriori (MAP) estimator, also found by the EM algorithm. For choosing the number of components and the model parameterization, we propose a modified version of BIC, where the likelihood is evaluated at the MAP instead of the MLE. We use a highly dispersed proper conjugate prior, containing a small fraction of one observation's worth of information. The resulting method avoids degeneracies and singularities, but when these are not present it gives similar results to the standard method using MLE, EM and BIC.  相似文献   

17.
Clustering the rows and columns of a contingency table   总被引:1,自引:2,他引:1  
A number of ways of investigating heterogeneity in a two-way contingency table are reviewed. In particular, we consider chi-square decompositions of the Pearson chi-square statistic with respect to the nodes of a hierarchical clustering of the rows and/or the columns of the table. A cut-off point which indicates significant clustering may be defined on the binary trees associated with the respective row and column cluster analyses. This approach provides a simple graphical procedure which is useful in interpreting a significant chi-square statistic of a contingency table.The author gratefully acknowledges the constructive comments of the referees and the editor.  相似文献   

18.
We study the application of simulated annealing and tabu search to the solution of the clique partitioning problem. We illustrate the effecveness of these techniques by computational results associated not only with randomly generated problems, but also with real-life problems arising from applications concerning the optimal aggregation of binary relations into an equivalence relation. The need for these approaches is emphasized by the example of a special class of instances of the clique partitioning problem for which the most commonly used heuristics perform arbitrarily badly, while tabu search systematically obtains the optimal solution.
Résumé Nous étudions dans cet article l'application du recuit simulé et de la méthode de recherche tabou dans la résolution du problème de partitionnement de graphes en cliques. Nous illustrons l'efficacité de ces techniques par des résultats numériques associés soit à des problèmes génerés au hasard, soit à des problèmes réels concernant l'agrégation de relations binaires dans une relation d'équivalence. L'intérêt de ces approches est mis en évidence à travers une classe de problèmes pour lesquels les heuristiques les plus connues ont une performance arbitrairement mauvaise, tandis que la méthode de recherche tabou obtient systématiquement des solutions optimales.
  相似文献   

19.
The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical application. A simple and common approach is to first cluster a random sample of moderate size, and then use the clustering model found in this way to classify the remainder of the objects. We show that, in its simplest form, this method may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling method: several tentative models are identified from the sample instead of just one, and several EM steps are used rather than just one E step to classify the full data set. We find that there are significant gains from increasing the size of the sample up to about 2,000, but not from further increases. These conclusions are based on the application of several alternative strategies to the segmentation of three different multispectral images, and to several simulated data sets.  相似文献   

20.
在21世纪经济全球化的新形势下,国际贸易、旅游业和交通业飞速发展,给外来生物入侵提供了大量机遇。近一二十年来生物入侵的迅速加剧,已造成大量的经济损失和生态灾难。毫无疑问,生物入侵在今后较长时期内将有增无减,对经济发展、生态安全和人体健康的危害将是巨大的、长远的,是人类生存必须有效应对的巨大挑战。虽然对生物入侵机制研究的理论和实践意义多有共识,  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号