首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We consider two fundamental properties in the analysis of two-way tables of positive data: the principle of distributional equivalence, one of the cornerstones of correspondence analysis of contingency tables, and the principle of subcompositional coherence, which forms the basis of compositional data analysis. For an analysis to be subcompositionally coherent, it suffices to analyze the ratios of the data values. A common approach to dimension reduction in compositional data analysis is to perform principal component analysis on the logarithms of ratios, but this method does not obey the principle of distributional equivalence. We show that by introducing weights for the rows and columns, the method achieves this desirable property and can be applied to a wider class of methods. This weighted log-ratio analysis is theoretically equivalent to “spectral mapping”, a multivariate method developed almost 30 years ago for displaying ratio-scale data from biological activity spectra. The close relationship between spectral mapping and correspondence analysis is also explained, as well as their connection with association modeling. The weighted log-ratio methodology is used here to visualize frequency data in linguistics and chemical compositional data in archeology. The first author acknowledges research support from the Fundación BBVA in Madrid as well as partial support by the Spanish Ministry of Education and Science, grant MEC-SEJ2006-14098. The constructive comments of the referees, who also brought additional relevant literature to our attention, significantly improved our article.  相似文献   

2.
A common approach to deal with missing values in multivariate exploratory data analysis consists in minimizing the loss function over all non-missing elements, which can be achieved by EM-type algorithms where an iterative imputation of the missing values is performed during the estimation of the axes and components. This paper proposes such an algorithm, named iterative multiple correspondence analysis, to handle missing values in multiple correspondence analysis (MCA). The algorithm, based on an iterative PCA algorithm, is described and its properties are studied. We point out the overfitting problem and propose a regularized version of the algorithm to overcome this major issue. Finally, performances of the regularized iterative MCA algorithm (implemented in the R-package named missMDA) are assessed from both simulations and a real dataset. Results are promising with respect to other methods such as the missing-data passive modified margin method, an adaptation of the missing passive method used in Gifi’s Homogeneity analysis framework.  相似文献   

3.
Approximate analysis of variance of spatially autocorrelated regional data   总被引:3,自引:0,他引:3  
The classical method for analysis of variance of data divided in geographic regions is impaired if the data are spatially autocorrelated within regions, because the condition of independence of the observations is not met. Positive autocorrelation reduces within-group variability, thus artificially increasing the relative amount of among-group variance. Negative autocorrelation may produce the opposite effect. This difficulty can be viewed as a loss of an unknown number of degrees of freedom. Such problems can be found in population genetics, in ecology and in other branches of biology, as well as in economics, epidemiology, geography, geology, marketing, political science, and sociology. A computer-intensive method has been developed to overcome this problem in certain cases. It is based on the computation of pooled within-group sums of squares for sampled permutations of internally connected areas on a map. The paper presents the theory, the algorithms, and results obtained using this method. A computer program, written in PASCAL, is available.This work was supported by NSERC grant no. A7738 to Pierre Legendre and by grant BSR 8614384 from the National Science Foundation to Robert R. Sokal. This is contribution No. 366 of the Groupe d'Ecologie des Eaux Douces, Université de Montréal, and contribution No. 727 in Ecology and Evolution from the State University of New York at Stony Brook.  相似文献   

4.
A latent class vector model for preference ratings   总被引:1,自引:1,他引:1  
A latent class formulation of the well-known vector model for preference data is presented. Assuming preference ratings as input data, the model simultaneously clusters the subjects into a small number of homogeneous groups (or latent classes) and constructs a joint geometric representation of the choice objects and the latent classes according to a vector model. The distributional assumptions on which the latent class approach is based are analogous to the distributional assumptions that are consistent with the common practice of fitting the vector model to preference data by least squares methods. An EM algorithm for fitting the latent class vector model is described as well as a procedure for selecting the appropriate number of classes and the appropriate number of dimensions. Some illustrative applications of the latent class vector model are presented and some possible extensions are discussed. Geert De Soete is supported as “Bevoegdverklaard Navorser” of the Belgian “Nationaal Fonds voor Wetenschappelijk Onderzoek.”  相似文献   

5.
A new method and a supporting theorem for designing multiple-class piecewise linear classifiers are described. The method involves the cutting of straight line segments joining pairs of opposed points (i.e., points from distinct classes) ind-dimensional space. We refer to such straight line segments aslinks. We show how nearly to minimize the number of hyperplanes required to cut all of these links, thereby yielding a near-Bayes-optimal decision surface regardless of the number of classes, and we describe the underlying theory. This method does not require parameters to be specified by users — an improvement over earlier methods. Experiments on multiple-class data obtained from ship images show that classifiers designed by this method yield approximately the same error rate as the bestk-nearest neighbor rule, while providing faster decisions.This research was supported in part by the Army Research Office under grant DAAG29-84-K-0208 and in part by the University of California MICRO Program. We thank R. W. Doucette of the U.S. Naval Weapons Center and R. D. Holben of Ford Aerospace Corporation for providing the ship images in our experiments.  相似文献   

6.
In agglomerative hierarchical clustering, pair-group methods suffer from a problem of non-uniqueness when two or more distances between different clusters coincide during the amalgamation process. The traditional approach for solving this drawback has been to take any arbitrary criterion in order to break ties between distances, which results in different hierarchical classifications depending on the criterion followed. In this article we propose a variable-group algorithm that consists in grouping more than two clusters at the same time when ties occur. We give a tree representation for the results of the algorithm, which we call a multidendrogram, as well as a generalization of the Lance andWilliams’ formula which enables the implementation of the algorithm in a recursive way. The authors thank A. Arenas for discussion and helpful comments. This work was partially supported by DGES of the Spanish Government Project No. FIS2006–13321–C02–02 and by a grant of Universitat Rovira i Virgili.  相似文献   

7.
Classifications are generally pictured in the form of hierarchical trees, also called dendrograms. A dendrogram is the graphical representation of an ultrametric (=cophenetic) matrix; so dendrograms can be compared to one another by comparing their cophenetic matrices. Three methods used in testing the correlation between matrices corresponding to dendrograms are evaluated. The three permutational procedures make use of different aspects of the information to compare dendrograms: the Mantel procedure permutes label positions only; the binary tree methods randomize the topology as well; the double-permutation procedure is based on all the information included in a dendrogram, that is: topology, label positions, and cluster heights. Theoretical and empirical investigations of these methods are carried out to evaluate their relative performance. Simulations show that the Mantel test is too conservative when applied to the comparison of dendrograms; the methods of binary tree comparisons do slightly better; only the doublepermutation test provides unbiased type I error. Les arbres utilisés pour illustrés les groupements sont généralement représentés sous la forme de classifications hiérarchiques ou dendrogrammes. Un dendrogramme représente graphiquement l’information contenue dans la matrice ultramétrique (=cophénétique) correspondant à la classification. Dès ultramétriques correspondantes. Nous comparons trois méthodes permettant d’évaluer la signification statistique du coefficient de correlation mesuré entre deux matrices ultramétriques. Ces trois tests par permutations tiennent compte d’aspects différents pour comparer des dendrogrammes: le test de Mantel permute les feuilles de l’arbre, les méthodes pour arbres binaires permutent les feuilles et la topologie, alors que la procédure à double permutation permute les feuilles, la topologie et les niveaux de fusion des dendrogrammes comparés. L’efficacité relative des trois méthodes est évaluée empiriquement et théoriquement. Nos résultats suggèrent l’utilisation préférentielle du test à double permutation pour la comparaison de dendrogrammes: le test de Mantel s’avère trop conservateur, tandis que les méthodes pour arbres binaires ne sont pas toujours adéquates.
This work was supported by NSERC grant no. A7738 to Pierre Legendre and by a NSERC scholarship to F.-J. Lapointe.  相似文献   

8.
Dendrograms are widely used to represent graphically the clusters and partitions obtained with hierarchical clustering schemes. Espaliers are generalized dendrograms in which the length of horizontal lines is used in addition to their level in order to display the values of two characteristics of each cluster (e.g., the split and the diameter) instead of only one. An algorithm is first presented to transform a dendrogram into an espalier without rotation of any part of the former. This is done by stretching some of the horizontal lines to obtain a diagram with vertical and horizontal lines only, the cutting off by diagonal lines the parts of the horizontal lines exceeding their prescribed length. The problem of finding if, allowing rotations, no diagonal lines are needed is solved by anO(N 2) algorithm whereN is the number of entities to be classified. This algorithm is the generalized to obtain espaliers with minimum width and, possibly, some diagonal lines.Work of the first and second authors has been supported by FCAR (Fonds pour la Formation de Chercheurs et l'Aide à la Recherche) grant 92EQ1048, and grant N00014-92-J-1194 from the Office of Naval Research. Work of the first author has also been supported by NSERC (Natural Sciences and Engineering Research Council of Canada) grant to École des Hautes Études Commerciales, Montréal and by NSERC grant GP0105574. Work of the second author has been supported by NSERC grant GP0036426, by FCAR grant 90NC0305, and by an NSF Professorship for Women in Science at Princeton University from September 1990 until December 1991. Work of the third author was done in part during a visit to GERAD, Montréal.  相似文献   

9.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.  相似文献   

10.
The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.  相似文献   

11.
12.
I consider a new problem of classification into n(n ≥ 2) disjoint classes based on features of unclassified data. It is assumed that the data are grouped into m(M ≥ n) disjoint sets and within each set the distribution of features is a mixture of distributions corresponding to particular classes. Moreover, the mixing proportions should be known and form a matrix of rank n. The idea of solution is, first, to estimate feature densities in all the groups, then to solve the linear system for component densities. The proposed classification method is asymptotically optimal, provided a consistent method of density estimation is used. For illustration, the method is applied to determining perfusion status in myocardial infarction patients, using creatine kinase measurements.  相似文献   

13.
Probabilistic D-Clustering   总被引:1,自引:1,他引:0  
We present a new iterative method for probabilistic clustering of data. Given clusters, their centers and the distances of data points from these centers, the probability of cluster membership at any point is assumed inversely proportional to the distance from (the center of) the cluster in question. This assumption is our working principle. The method is a generalization, to several centers, of theWeiszfeld method for solving the Fermat–Weber location problem. At each iteration, the distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for all data points, and the centers are updated as convex combinations of these points, with weights determined by the above principle. Computations stop when the centers stop moving.  相似文献   

14.
A modified CANDECOMP algorithm is presented for fitting the metric version of the Extended INDSCAL model to three-way proximity data. The Extended INDSCAL model assumes, in addition to the common dimensions, a unique dimension for each object. The modified CANDECOMP algorithm fits the Extended INDSCAL model in a dimension-wise fashion and ensures that the subject weights for the common and the unique dimensions are nonnegative. A Monte Carlo study is reported to illustrate that the method is fairly insensitive to the choice of the initial parameter estimates. A second Monte Carlo study shows that the method is able to recover an underlying Extended INDSCAL structure if present in the data. Finally, the method is applied for illustrative purposes to some empirical data on pain relievers. In the final section, some other possible uses of the new method are discussed. Geert De Soete is supported as “Bevoegdverklaard Navorser” of the Belgian “Nationaal Fonds voor Wetenschappelijik Onderzoek”.  相似文献   

15.
Analysis of between-group differences using canonical variates assumes equality of population covariance matrices. Sometimes these matrices are sufficiently different for the null hypothesis of equality to be rejected, but there exist some common features which should be exploited in any analysis. The common principal component model is often suitable in such circumstances, and this model is shown to be appropriate in a practical example. Two methods for between-group analysis are proposed when this model replaces the equal dispersion matrix assumption. One method is by extension of the two-stage approach to canonical variate analysis using sequential principal component analyses as described by Campbell and Atchley (1981). The second method is by definition of a distance function between populations satisfying the common principal component model, followed by metric scaling of the resulting between-populations distance matrix. The two methods are compared with each other and with ordinary canonical variate analysis on the previously introduced data set.  相似文献   

16.
Graphical representation of nonsymmetric relationships data has usually proceeded via separate displays for the symmetric and the skew-symmetric parts of a data matrix. DEDICOM avoids splitting the data into symmetric and skewsymmetric parts, but lacks a graphical representation of the results. Chino's GIPSCAL combines features of both models, but may have a poor goodness-of-fit compared to DEDICOM. We simplify and generalize Chino's method in such a way that it fits the data better. We develop an alternating least squares algorithm for the resulting method, called Generalized GIPSCAL, and adjust it to handle GIPSCAL as well. In addition, we show that Generalized GIPSCAL is a constrained variant of DEDICOM and derive necessary and sufficient conditions for equivalence of the two models. Because these conditions are rather mild, we expect that in many practical cases DEDICOM and Generalized GIPSCAL are (nearly) equivalent, and hence that the graphical representation from Generalized GIPSCAL can be used to display the DEDICOM results graphically. Such a representation is given for an illustration. Finally, we show Generalized GIPSCAL to be a generalization of another method for joint representation of the symmetric and skew-symmetric parts of a data matrix.This research has been made possible by a fellowship from the Royal Netherlands Academy of Arts and Sciences to the first author, and by research grant number A6394 to the second author, from the Natural Sciences and Engineering Research Council of Canada. The authors are obliged to Jos ten Berge and Naohito Chino for stimulating comments.  相似文献   

17.
Bayesian classification is currently of considerable interest. It provides a strategy for eliminating the uncertainty associated with a particular choice of classifiermodel parameters, and is the optimal decision-theoretic choice under certain circumstances when there is no single “true” classifier for a given data set. Modern computing capabilities can easily support the Markov chain Monte Carlo sampling that is necessary to carry out the calculations involved, but the information available in these samples is not at present being fully utilised. We show how it can be allied to known results concerning the “reject option” in order to produce an assessment of the confidence that can be ascribed to particular classifications, and how these confidence measures can be used to compare the performances of classifiers. Incorporating these confidence measures can alter the apparent ranking of classifiers as given by straightforward success or error rates. Several possible methods for obtaining confidence assessments are described, and compared on a range of data sets using the Bayesian probabilistic nearest-neighbour classifier.  相似文献   

18.
Optimization Strategies for Two-Mode Partitioning   总被引:2,自引:2,他引:0  
Two-mode partitioning is a relatively new form of clustering that clusters both rows and columns of a data matrix. In this paper, we consider deterministic two-mode partitioning methods in which a criterion similar to k-means is optimized. A variety of optimization methods have been proposed for this type of problem. However, it is still unclear which method should be used, as various methods may lead to non-global optima. This paper reviews and compares several optimization methods for two-mode partitioning. Several known methods are discussed, and a new fuzzy steps method is introduced. The fuzzy steps method is based on the fuzzy c-means algorithm of Bezdek (1981) and the fuzzy steps approach of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performances of all methods are compared in a large simulation study. In our simulations, a two-mode k-means optimization method most often gives the best results. Finally, an empirical data set is used to give a practical example of two-mode partitioning. We would like to thank two anonymous referees whose comments have improved the quality of this paper. We are also grateful to Peter Verhoef for providing the data set used in this paper.  相似文献   

19.
A natural extension of classical metric multidimensional scaling is proposed. The result is a new formulation of nonmetric multidimensional scaling in which the strain criterion is minimized subject to order constraints on the disparity variables. Innovative features of the new formulation include: the parametrization of the p-dimensional distance matrices by the positive semidefinite matrices of rank ≤p; optimization of the (squared) disparity variables, rather than the configuration coordinate variables; and a new nondegeneracy constraint, which restricts the set of (squared) disparities rather than the set of distances. Solutions are obtained using an easily implemented gradient projection method for numerical optimization. The method is applied to two published data sets.  相似文献   

20.
A sequential fitting procedure for linear data analysis models   总被引:1,自引:1,他引:0  
A particular factor analysis model with parameter constraints is generalized to include classification problems definable within a framework of fitting linear models. The sequential fitting (SEFIT) approach of principal component analysis is extended to include several nonstandard data analysis and classification tasks. SEFIT methods attempt to explain the variability in the initial data (commonly defined by a sum of squares) through an additive decomposition attributable to the various terms in the model. New methods are developed for both traditional and fuzzy clustering that have useful theoretic and computational properties (principal cluster analysis, additive clustering, and so on). Connections to several known classification strategies are also stated.The author is grateful to P. Arabie and L. J. Hubert for editorial assistance and reviewing going well beyond traditional levels.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号