首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 578 毫秒
1.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.  相似文献   

2.
This paper introduces a novel mixture model-based approach to the simultaneous clustering and optimal segmentation of functional data, which are curves presenting regime changes. The proposed model consists of a finite mixture of piecewise polynomial regression models. Each piecewise polynomial regression model is associated with a cluster, and within each cluster, each piecewise polynomial component is associated with a regime (i.e., a segment). We derive two approaches to learning the model parameters: the first is an estimation approach which maximizes the observed-data likelihood via a dedicated expectation-maximization (EM) algorithm, then yielding a fuzzy partition of the curves into K clusters obtained at convergence by maximizing the posterior cluster probabilities. The second is a classification approach and optimizes a specific classification likelihood criterion through a dedicated classification expectation-maximization (CEM) algorithm. The optimal curve segmentation is performed by using dynamic programming. In the classification approach, both the curve clustering and the optimal segmentation are performed simultaneously as the CEM learning proceeds. We show that the classification approach is a probabilistic version generalizing the deterministic K-means-like algorithm proposed in Hébrail, Hugueney, Lechevallier, and Rossi (2010). The proposed approach is evaluated using simulated curves and real-world curves. Comparisons with alternatives including regression mixture models and the K-means-like algorithm for piecewise regression demonstrate the effectiveness of the proposed approach.  相似文献   

3.
Generation of Random Clusters with Specified Degree of Separation   总被引:1,自引:1,他引:0  
We propose a random cluster generation algorithm that has the desired features: (1) the population degree of separation between clusters and the nearest neighboring clusters can be set to a specified value, based on a separation index; (2) no constraint is imposed on the isolation among clusters in each dimension; (3) the covariance matrices correspond to different shapes, diameters and orientations; (4) the full cluster structures generally could not be detected simply from pair-wise scatterplots of variables; (5) noisy variables and outliers can be imposed to make the cluster structures harder to be recovered. This algorithm is an improvement on the method used in Milligan (1985).  相似文献   

4.
An error variance approach to two-mode hierarchical clustering   总被引:2,自引:2,他引:0  
A new agglomerative method is proposed for the simultaneous hierarchical clustering of row and column elements of a two-mode data matrix. The procedure yields a nested sequence of partitions of the union of two sets of entities (modes). A two-mode cluster is defined as the union of subsets of the respective modes. At each step of the agglomerative process, the algorithm merges those clusters whose fusion results in the smallest possible increase in an internal heterogeneity measure. This measure takes into account both the variance within the respective cluster and its centroid effect defined as the squared deviation of its mean from the maximum entry in the input matrix. The procedure optionally yields an overlapping cluster solution by assigning further row and/or column elements to clusters existing at a preselected hierarchical level. Applications to real data sets drawn from consumer research concerning brand-switching behavior and from personality research concerning the interaction of behaviors and situations demonstrate the efficacy of the method at revealing the underlying two-mode similarity structure.  相似文献   

5.
T clusters, based on J distinct, contributory partitions (or, equivalently, J polytomous attributes). We describe a new model/algorithm for implementing this objective. The method's objective function incorporates a modified Rand measure, both in initial cluster selection and in subsequent refinement of the starting partition. The method is applied to both synthetic and real data. The performance of the proposed model is compared to latent class analysis of the same data set.  相似文献   

6.
When clustering asymmetric proximity data, only the average amounts are often considered by assuming that the asymmetry is due to noise. But when the asymmetry is structural, as typically may happen for exchange flows, migration data or confusion data, this may strongly affect the search for the groups because the directions of the exchanges are ignored and not integrated in the clustering process. The clustering model proposed here relies on the decomposition of the asymmetric dissimilarity matrix into symmetric and skew-symmetric effects both decomposed in within and between cluster effects. The classification structures used here are generally based on two different partitions of the objects fitted to the symmetric and the skew-symmetric part of the data, respectively; the restricted case is also presented where the partition fits jointly both of them allowing for clusters of objects similar with respect to the average amounts and directions of the data. Parsimonious models are presented which allow for effective and simple graphical representations of the results.  相似文献   

7.
Complete linkage as a multiple stopping rule for single linkage clustering   总被引:2,自引:2,他引:0  
Two commonly used clustering criteria are single linkage, which maximizes the minimum distance between clusters, and complete linkage, which minimizes the maximum distance within a cluster. By synthesizing these criteria, partitions of objects are sought which maximize a combined measure of the minimum distance between clusters and the maximum distance within a cluster. Each combined measure is shown to select a partition in the single linkage hierarchy. Therefore, in effect, complete linkage is used to provide a stopping rule for single linkage. An algorithm is outlined which uses the distance between each pair of objects twice only. To illustrate the method, an example is given using 23 Glamorganshire soil profiles.  相似文献   

8.
Incremental Classification with Generalized Eigenvalues   总被引:2,自引:0,他引:2  
Supervised learning techniques are widely accepted methods to analyze data for scientific and real world problems. Most of these problems require fast and continuous acquisition of data, which are to be used in training the learning system. Therefore, maintaining such systems updated may become cumbersome. Various techniques have been devised in the field of machine learning to solve this problem. In this study, we propose an algorithm to reduce the training data to a substantially small subset of the original training data to train a generalized eigenvalue classifier. The proposed method provides a constructive way to understand the influence of new training data on an existing classification function. We show through numerical experiments that this technique prevents the overfitting problem of the earlier generalized eigenvalue classifiers, while promising a comparable performance in classification with respect to the state-of-the-art classification methods.  相似文献   

9.
A sequential fitting procedure for linear data analysis models   总被引:1,自引:1,他引:0  
A particular factor analysis model with parameter constraints is generalized to include classification problems definable within a framework of fitting linear models. The sequential fitting (SEFIT) approach of principal component analysis is extended to include several nonstandard data analysis and classification tasks. SEFIT methods attempt to explain the variability in the initial data (commonly defined by a sum of squares) through an additive decomposition attributable to the various terms in the model. New methods are developed for both traditional and fuzzy clustering that have useful theoretic and computational properties (principal cluster analysis, additive clustering, and so on). Connections to several known classification strategies are also stated.The author is grateful to P. Arabie and L. J. Hubert for editorial assistance and reviewing going well beyond traditional levels.  相似文献   

10.
Classification and spatial methods can be used in conjunction to represent the individual information of similar preferences by means of groups. In the context of latent class models and using Simulated Annealing, the cluster-unfolding model for two-way two-mode preference rating data has been shown to be superior to a two-step approach of first deriving the clusters and then unfolding the classes. However, the high computational cost makes the procedure only suitable for small or medium-sized data sets, and the hypothesis of independent and normally distributed preference data may also be too restrictive in many practical situations. Therefore, an alternating least squares procedure is proposed, in which the individuals and the objects are partitioned into clusters, while at the same time the cluster centers are represented by unfolding. An enhanced Simulated Annealing algorithm in the least squares framework is also proposed in order to address the local optimum problem. Real and artificial data sets are analyzed to illustrate the performance of the model.  相似文献   

11.
To reveal the structure underlying two-way two-mode object by variable data, Mirkin (1987) has proposed an additive overlapping clustering model. This model implies an overlapping clustering of the objects and a reconstruction of the data, with the reconstructed variable profile of an object being a summation of the variable profiles of the clusters it belongs to. Grasping the additive (overlapping) clustering structure of object by variable data may, however, be seriously hampered in case the data include a very large number of variables. To deal with this problem, we propose a new model that simultaneously clusters the objects in overlapping clusters and reduces the variable space; as such, the model implies that the cluster profiles and, hence, the reconstructed data profiles are constrained to lie in a lowdimensional space. An alternating least squares (ALS) algorithm to fit the new model to a given data set will be presented, along with a simulation study and an illustrative example that makes use of empirical data.  相似文献   

12.
Block-Relaxation Approaches for Fitting the INDCLUS Model   总被引:1,自引:1,他引:0  
A well-known clustering model to represent I?×?I?×?J data blocks, the J frontal slices of which consist of I?×?I object by object similarity matrices, is the INDCLUS model. This model implies a grouping of the I objects into a prespecified number of overlapping clusters, with each cluster having a slice-specific positive weight. An INDCLUS model is fitted to a given data set by means of minimizing a least squares loss function. The minimization of this loss function has appeared to be a difficult problem for which several algorithmic strategies have been proposed. At present, the best available option seems to be the SYMPRES algorithm, which minimizes the loss function by means of a block-relaxation algorithm. Yet, SYMPRES is conjectured to suffer from a severe local optima problem. As a way out, based on theoretical results with respect to optimally designing block-relaxation algorithms, five alternative block-relaxation algorithms are proposed. In a simulation study it appears that the alternative algorithms with overlapping parameter subsets perform best and clearly outperform SYMPRES in terms of optimization performance and cluster recovery.  相似文献   

13.
K -means partitioning. We also describe some new features and improvements to the algorithm proposed by De Soete. Monte Carlo simulations have been conducted using different error conditions. In all cases (i.e., ultrametric or additive trees, or K-means partitioning), the simulation results indicate that the optimal weighting procedure should be used for analyzing data containing noisy variables that do not contribute relevant information to the classification structure. However, if the data involve error-perturbed variables that are relevant to the classification or outliers, it seems better to cluster or partition the entities by using variables with equal weights. A new computer program, OVW, which is available to researchers as freeware, implements improved algorithms for optimal variable weighting for ultrametric and additive tree clustering, and includes a new algorithm for optimal variable weighting for K-means partitioning.  相似文献   

14.
Consider N entities to be classified (e.g., geographical areas), a matrix of dissimilarities between pairs of entities, a graph H with vertices associated with these entities such that the edges join the vertices corresponding to contiguous entities. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside of it. The single-linkage algorithm (ignoring contiguity between entities) provides partitions into M clusters for which the smallest split of the clusters, called split of the partition, is maximum. We study here the partitioning of the set of entities into M connected clusters for all M between N - 1 and 2 (i.e., clusters such that the subgraphs of H induced by their corresponding sets of entities are connected) with maximum split subject to that condition. We first provide an exact algorithm with a (N2) complexity for the particular case in which H is a tree. This algorithm suggests in turn a first heuristic algorithm for the general problem. Several variants of this heuristic are Also explored. We then present an exact algorithm for the general case based on iterative determination of cocycles of subtrees and on the solution of auxiliary set covering problems. As solution of the latter problems is time-consuming for large instances, we provide another heuristic in which the auxiliary set covering problems are solved approximately. Computational results obtained with the exact and heuristic algorithms are presented on test problems from the literature.  相似文献   

15.
This paper develops a new procedure for simultaneously performing multidimensional scaling and cluster analysis on two-way compositional data of proportions. The objective of the proposed procedure is to delineate patterns of variability in compositions across subjects by simultaneously clustering subjects into latent classes or groups and estimating a joint space of stimulus coordinates and class-specific vectors in a multidimensional space. We use a conditional mixture, maximum likelihood framework with an E-M algorithm for parameter estimation. The proposed procedure is illustrated using a compositional data set reflecting proportions of viewing time across television networks for an area sample of households.  相似文献   

16.
We describe a novel extension to the Class-Cover-Catch-Digraph (CCCD) classifier, specifically tuned to detection problems. These are two-class classification problems where the natural priors on the classes are skewed by several orders of magnitude. The emphasis of the proposed techniques is in computationally efficient classification for real-time applications. Our principal contribution consists of two boosted classi- fiers built upon the CCCD structure, one in the form of a sequential decision process and the other in the form of a tree. Both of these classifiers achieve performances comparable to that of the original CCCD classifiers, but at drastically reduced computational expense. An analysis of classification performance and computational cost is performed using data from a face detection application. Comparisons are provided with Support Vector Machines (SVM) and reduced SVMs. These comparisons show that while some SVMs may achieve higher classification performance, their computational burden can be so high as to make them unusable in real-time applications. On the other hand, the proposed classifiers combine high detection performance with extremely fast classification.  相似文献   

17.
Single linkage clusters on a set of points are the maximal connected sets in a graph constructed by connecting all points closer than a given threshold distance. The complete set of single linkage clusters is obtained from all the graphs constructed using different threshold distances. The set of clusters forms a hierarchical tree, in which each non-singleton cluster divides into two or more subclusters; the runt size for each single linkage cluster is the number of points in its smallest subcluster. The maximum runt size over all single linkage clusters is our proposed test statistic for assessing multimodality. We give significance levels of the test for two null hypotheses, and consider its power against some bimodal alternatives. Research partially supported by NSF Grant No. DMS-8617919.  相似文献   

18.
Clustering techniques are based upon a dissimilarity or distance measure between objects and clusters. This paper focuses on the simplex space, whose elements??compositions??are subject to non-negativity and constant-sum constraints. Any data analysis involving compositions should fulfill two main principles: scale invariance and subcompositional coherence. Among fuzzy clustering methods, the FCM algorithm is broadly applied in a variety of fields, but it is not well-behaved when dealing with compositions. Here, the adequacy of different dissimilarities in the simplex, together with the behavior of the common log-ratio transformations, is discussed in the basis of compositional principles. As a result, a well-founded strategy for FCM clustering of compositions is suggested. Theoretical findings are accompanied by numerical evidence, and a detailed account of our proposal is provided. Finally, a case study is illustrated using a nutritional data set known in the clustering literature.  相似文献   

19.
The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.  相似文献   

20.
k consisting of k clusters, with k > 2. Bottom-up agglomerative approaches are also commonly used to construct partitions, and we discuss these in terms of worst-case performance for metric data sets. Our main contribution derives from a new restricted partition formulation that requires each cluster to be an interval of a given ordering of the objects being clustered. Dynamic programming can optimally split such an ordering into a partition Pk for a large class of objectives that includes min-diameter. We explore a variety of ordering heuristics and show that our algorithm, when combined with an appropriate ordering heuristic, outperforms traditional algorithms on both random and non-random data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号