首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The Kohonen self-organizing map method: An assessment   总被引:1,自引:0,他引:1  
The “self-organizing map” method, due to Kohonen, is a well-known neural network method. It is closely related to cluster analysis (partitioning) and other methods of data analysis. In this article, we explore some of these close relationships. A number of properties of the technique are discussed. Comparisons with various methods of data analysis (principal components analysis, k-means clustering, and others) are presented. This work has been partially supported for M. Hernández-Pajares by the DGCICIT of Spain under grant No. PB90-0478 and by a CESCA-1993 computer-time grant. Fionn Murtagh is affiliated to the Astrophysics Division, Space Science Department, European Space Agency.  相似文献   

2.
Dendrograms are widely used to represent graphically the clusters and partitions obtained with hierarchical clustering schemes. Espaliers are generalized dendrograms in which the length of horizontal lines is used in addition to their level in order to display the values of two characteristics of each cluster (e.g., the split and the diameter) instead of only one. An algorithm is first presented to transform a dendrogram into an espalier without rotation of any part of the former. This is done by stretching some of the horizontal lines to obtain a diagram with vertical and horizontal lines only, the cutting off by diagonal lines the parts of the horizontal lines exceeding their prescribed length. The problem of finding if, allowing rotations, no diagonal lines are needed is solved by anO(N 2) algorithm whereN is the number of entities to be classified. This algorithm is the generalized to obtain espaliers with minimum width and, possibly, some diagonal lines.Work of the first and second authors has been supported by FCAR (Fonds pour la Formation de Chercheurs et l'Aide à la Recherche) grant 92EQ1048, and grant N00014-92-J-1194 from the Office of Naval Research. Work of the first author has also been supported by NSERC (Natural Sciences and Engineering Research Council of Canada) grant to École des Hautes Études Commerciales, Montréal and by NSERC grant GP0105574. Work of the second author has been supported by NSERC grant GP0036426, by FCAR grant 90NC0305, and by an NSF Professorship for Women in Science at Princeton University from September 1990 until December 1991. Work of the third author was done in part during a visit to GERAD, Montréal.  相似文献   

3.
We consider two fundamental properties in the analysis of two-way tables of positive data: the principle of distributional equivalence, one of the cornerstones of correspondence analysis of contingency tables, and the principle of subcompositional coherence, which forms the basis of compositional data analysis. For an analysis to be subcompositionally coherent, it suffices to analyze the ratios of the data values. A common approach to dimension reduction in compositional data analysis is to perform principal component analysis on the logarithms of ratios, but this method does not obey the principle of distributional equivalence. We show that by introducing weights for the rows and columns, the method achieves this desirable property and can be applied to a wider class of methods. This weighted log-ratio analysis is theoretically equivalent to “spectral mapping”, a multivariate method developed almost 30 years ago for displaying ratio-scale data from biological activity spectra. The close relationship between spectral mapping and correspondence analysis is also explained, as well as their connection with association modeling. The weighted log-ratio methodology is used here to visualize frequency data in linguistics and chemical compositional data in archeology. The first author acknowledges research support from the Fundación BBVA in Madrid as well as partial support by the Spanish Ministry of Education and Science, grant MEC-SEJ2006-14098. The constructive comments of the referees, who also brought additional relevant literature to our attention, significantly improved our article.  相似文献   

4.
Parameters are derived of distributions of three coefficients of similarity between pairs (dyads) of operational taxonomic units for multivariate binary data (presence/absence of attributes) under statistical independence. These are applied to test independence for dyadic data. Association among attributes within operational taxonomic units is allowed. It is also permissible for the two units in the dyad to be drawn from different populations having different presence probabilities of attributes. The variance of the distribution of the similarity coefficients under statistical independence is shown to be relatively large in many empirical situations. This result implies that the practical interpretation of these coefficients requires much care. An application using the Jaccard index is given for the assessment of consensus between psychotherapists and their clients.
La distribution des coefficients de similarité pour les données binaires et les attributs associés
Résumé Les paramètres de la distribution de trois coefficients de similarité entre paires d'éléments taxinomiques opérationels de données multivariables binaires (présence/absence) ont été dérivés dans l'hypothèse d'indépendance statistique. Ces paramètres sont utilisés dans un test d'indépendance pour les données dyadiques. L'existence est autorisée, dans la population d'éléments, d'une association entre plusieurs attributs. Il est également permis que les deux éléments de la dyade soient tirés de deux populations différentes, ayant différentes probabilit és quant à la présence des attributs. Dans beaucoup de situations empiriques, la variance des coefficients de similarité peut être relativement élevée dans le cas d'indépendance statistique. Par conséquence, ces coefficients doivent être interprétés avec précaution. Un exemple est donné pour le coefficient de Jaccard, qui a été employé dans une recherche sur la concordance entre des psychothérapeutes et leurs clients.
  相似文献   

5.
We propose a development stemming from Roux (1988). The principle is progressively to modify the dissimilarities so that every quadruple satisfies not only the additive inequality, as in Roux's method, but also all triangle inequalities. Our method thus ensures that the results are tree distances even when the observed dissimilarities are nonmetric. The method relies on the analytic solution of the least-squares projection onto a tree distance of the dissimilarities attached to a single quadruple. This goal is achieved by using geometric reasoning which also enables an easy proof of algorithm's convergence. This proof is simpler and more complete than that of Roux (1988) and applies to other similar reduction methods based on local least-squares projection. The method is illustrated using Case's (1978) data. Finally, we provide a comparative study with simulated data and show that our method compares favorably with that of Studier and Keppler (1988) which follows in the ADDTREE tradition (Sattath and Tversky 1977). Moreover, this study seems to indicate that our method's results are generally close to the global optimum according to variance accounted for.We offer sincere thanks to Gilles Caraux, Bernard Fichet, Alain Guénoche, and Maurice Roux for helpful discussions, advice, and for reading the preliminary versions of this paper. We are grateful to three anonymous referees and to the editor for many insightful comments. This research was supported in part by the GREG and the IA2 network.  相似文献   

6.
A new method and a supporting theorem for designing multiple-class piecewise linear classifiers are described. The method involves the cutting of straight line segments joining pairs of opposed points (i.e., points from distinct classes) ind-dimensional space. We refer to such straight line segments aslinks. We show how nearly to minimize the number of hyperplanes required to cut all of these links, thereby yielding a near-Bayes-optimal decision surface regardless of the number of classes, and we describe the underlying theory. This method does not require parameters to be specified by users — an improvement over earlier methods. Experiments on multiple-class data obtained from ship images show that classifiers designed by this method yield approximately the same error rate as the bestk-nearest neighbor rule, while providing faster decisions.This research was supported in part by the Army Research Office under grant DAAG29-84-K-0208 and in part by the University of California MICRO Program. We thank R. W. Doucette of the U.S. Naval Weapons Center and R. D. Holben of Ford Aerospace Corporation for providing the ship images in our experiments.  相似文献   

7.
Graphical representation of nonsymmetric relationships data has usually proceeded via separate displays for the symmetric and the skew-symmetric parts of a data matrix. DEDICOM avoids splitting the data into symmetric and skewsymmetric parts, but lacks a graphical representation of the results. Chino's GIPSCAL combines features of both models, but may have a poor goodness-of-fit compared to DEDICOM. We simplify and generalize Chino's method in such a way that it fits the data better. We develop an alternating least squares algorithm for the resulting method, called Generalized GIPSCAL, and adjust it to handle GIPSCAL as well. In addition, we show that Generalized GIPSCAL is a constrained variant of DEDICOM and derive necessary and sufficient conditions for equivalence of the two models. Because these conditions are rather mild, we expect that in many practical cases DEDICOM and Generalized GIPSCAL are (nearly) equivalent, and hence that the graphical representation from Generalized GIPSCAL can be used to display the DEDICOM results graphically. Such a representation is given for an illustration. Finally, we show Generalized GIPSCAL to be a generalization of another method for joint representation of the symmetric and skew-symmetric parts of a data matrix.This research has been made possible by a fellowship from the Royal Netherlands Academy of Arts and Sciences to the first author, and by research grant number A6394 to the second author, from the Natural Sciences and Engineering Research Council of Canada. The authors are obliged to Jos ten Berge and Naohito Chino for stimulating comments.  相似文献   

8.
Data in an experimental array where a nominal dependent variable hasm>2 outcomes may be accounted for by one of a number of possible schemes consisting ofJ successive and/or parallel independentm i-nomial experiments where m i =m +J – 1. Each such scheme can be represented by a tree diagram which is presumed to be valid everywhere in the array. A criterion based on likelihood is defined to assess the different schemes. The set of outcome probabilities of a scheme is shown to differ from that of all other schemes almost everywhere in the space of parameters. As sample size increases, the probability of correctly inferring the true tree tends to 1. Using Monte-Carlo simulation of the four-outcome case, we illustrate, for small sample sizes, how this probability depends on the parameters.
Résumé Une famille de modèles est proposée pour analyser un ensemble de données dont les observations sont faites sur une variable réponse discrète et sur un vecteur explicatif. Chaque modèle est constitué d'une série d'expériences multinomiales dont les résultats sont des regroupements de modalités de la variable réponse. Les probabilités d'observer ces regroupements dépendent du vecteur explicatif selon des équations logistique-lin éaires. On prouve facilement que chaque modèle de cette famille contient le même nombre de paramètres. De plus chaque modèle correspond à une structure d'arbres qui classifie hiérarchiquement les modalités de la variable réponse: un noeud non terminal de l'arbre représente une de ces expériences multinomiales et un noeud terminal représente une modalité.

Ainsi la probabilité d'observer une de ces modalités est calculée en parcourant le chemin reliant la racine au noeud terminal représentant cette modalité et le choix du modèle est basé sur un critère de vraisemblance calculée comme le produit des vraisemblances évaluées à partir de l'ensemble de données pour chaque noeud non terminal de l'arbre. On démontre que la capacité de prédiction des modalités diffère pour chaque arbre et seul le vrai arbre peut exhiber les vraies probabilités sur presque tout l'espace paramétrique. On y démontre aussi des propriétés asymptotiques du critère qui assurent que le vrai modèle est choisi par ce critère avec probabilité 1. Une étude par simulation Monte-Carlo illustre, dans le cas de petits échantillons, la dépendance de la probabilité que le vrai modèle soit choisi sur les valeurs des paramètres.
  相似文献   

9.
ConsiderN entities to be classified, with given weights, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity in that cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We consider the problems of finding maximum split partitions with exactlyM clusters and with at mostM clusters subject to the additional constraint that the sum of the weights of the entities in each cluster never exceeds a given bound. These two problems are shown to be NP-hard and reducible to a sequence of bin-packing problems. A (N 2) algorithm for the particular caseM =N of the second problem is also presented. Computational experience is reported.Acknowledgments: Work of the first author was supported in part by AFOSR grants 0271 and 0066 to Rutgers University and was done in part during a visit to GERAD, Ecole Polytechnique de Montréal, whose support is gratefully acknowledged. Work of the second and third authors was supported by NSERC grant GP0036426 and by FCAR grant 89EQ4144. We are grateful to Silvano Martello and Paolo Toth for making available to us their program MTP for the bin-paking problem and to three anonymous referees for comments which helped to improve the presentation of the paper.  相似文献   

10.
In agglomerative hierarchical clustering, pair-group methods suffer from a problem of non-uniqueness when two or more distances between different clusters coincide during the amalgamation process. The traditional approach for solving this drawback has been to take any arbitrary criterion in order to break ties between distances, which results in different hierarchical classifications depending on the criterion followed. In this article we propose a variable-group algorithm that consists in grouping more than two clusters at the same time when ties occur. We give a tree representation for the results of the algorithm, which we call a multidendrogram, as well as a generalization of the Lance andWilliams’ formula which enables the implementation of the algorithm in a recursive way. The authors thank A. Arenas for discussion and helpful comments. This work was partially supported by DGES of the Spanish Government Project No. FIS2006–13321–C02–02 and by a grant of Universitat Rovira i Virgili.  相似文献   

11.
It is common practice to perform a principal component analysis (PCA) on a correlation matrix to represent graphically the relations among numerous variables. In such a situation, the variables may be considered as points on the unit hypersphere of an Euclidean space, and PCA provides a sort of best fit of these points within a subspace. Taking into account their particular position, this paper suggests to represent the variables on an optimal three-dimensional unit sphere.
Résumé Il est classique d'utiliser une analyse en composantes principales pour représenter graphiquement une matrice de corrélation. Dans une telle situation, les variables peuvent être considérées comme des points sur l'hypersphère unité d'un espace Euclidien, et l'analyse en composantes principales permet d'obtenir une bonne approximation de ces points à l'aide d'un sous-espace Euclidien. Prenant en compte une telle situation géométrique, le présent article suggère de représenter les variables sur une sphère tri-dimensionelle optimale.
  相似文献   

12.
In this paper, we consider an entropy criterion to estimate the number of clusters arising from a mixture model. This criterion is derived from a relation linking the likelihood and the classification likelihood of a mixture. Its performance is investigated through Monte Carlo experiments, and it shows favorable results compared to other classical criteria.
Résumé Nous proposons un critère d'entropie pour évaluer le nombre de classes d'une partition en nous fondant sur un modèle de mélange de lois de probabilité. Ce critère se déduit d'une relation liant la vraisemblance et la vraisemblance classifiante d'un mélange. Des simulations de Monte Carlo illustrent ses qualités par rapport à des critères plus classiques.
  相似文献   

13.
Non-symmetrical correspondence analysis (NSCA) is a very practical statistical technique for the identification of the structure of association between asymmetrically related categorical variables forming a contingency table. This paper considers some tools that can be used to numerically and graphically explore in detail the association between these variables and include the use of confidence regions, the establishment of the link between NSCA and the analysis of variance of categorical variables, and the effect of imposing linear constraints on a variable. The authors would like to thank the anonymous referees for their comments and suggestions during the preparation of this paper.  相似文献   

14.
We describe a simple time series transformation to detect differences in series that can be accurately modelled as stationary autoregressive (AR) processes. The transformation involves forming the histogram of above and below the mean run lengths. The run length (RL) transformation has the benefits of being very fast, compact and updatable for new data in constant time. Furthermore, it can be generated directly from data that has already been highly compressed. We first establish the theoretical asymptotic relationship between run length distributions and AR models through consideration of the zero crossing probability and the distribution of runs. We benchmark our transformation against two alternatives: the truncated Autocorrelation function (ACF) transform and the AR transformation, which involves the standard method of fitting the partial autocorrelation coefficients with the Durbin-Levinson recursions and using the Akaike Information Criterion stopping procedure. Whilst optimal in the idealized scenario, representing the data in these ways is time consuming and the representation cannot be updated online for new data. We show that for classification problems the accuracy obtained through using the run length distribution tends towards that obtained from using the full fitted models. We then propose three alternative distance measures for run length distributions based on Gower’s general similarity coefficient, the likelihood ratio and dynamic time warping (DTW). Through simulated classification experiments we show that a nearest neighbour distance based on DTW converges to the optimal faster than classifiers based on Euclidean distance, Gower’s coefficient and the likelihood ratio. We experiment with a variety of classifiers and demonstrate that although the RL transform requires more data than the best performing classifier to achieve the same accuracy as AR or ACF, this factor is at worst non-increasing with the series length, m, whereas the relative time taken to fit AR and ACF increases with m. We conclude that if the data is stationary and can be suitably modelled by an AR series, and if time is an important factor in reaching a discriminatory decision, then the run length distribution transform is a simple and effective transformation to use.  相似文献   

15.
A low-dimensional representation of multivariate data is often sought when the individuals belong to a set ofa-priori groups and the objective is to highlight between-group variation relative to that within groups. If all the data are continuous then this objective can be achieved by means of canonical variate analysis, but no corresponding technique exists when the data are categorical or mixed continuous and categorical. On the other hand, if there is noa-priori grouping of the individuals, then ordination of any form of data can be achieved by use of metric scaling (principal coordinate analysis). In this paper we consider a simple extension of the latter approach to incorporate grouped data, and discuss to what extent this method can be viewed as a generalization of canonical variate analysis. Some illustrative examples are also provided.  相似文献   

16.
Percept variance is shown to change the additive property of city-block distances and make city-block distances more subadditive than Euclidean distances. Failure to account for percept variance will result in the misclassification of city-block data as Euclidean. A maximum likelihood estimation procedure is proposed for the multidimensional scaling of similarity data characterized by percept variance. Monte Carlo and empirical experiments are used to evaluate the proposed approach.  相似文献   

17.
Tree enumeration modulo a consensus   总被引:1,自引:1,他引:0  
The number of trees withn labeled terminal vertices grows too rapidly withn to permit exhaustive searches for Steiner trees or other kinds of optima in cladistics and related areas Often, however, structured constraints are known and may be imposed on the set of trees to be scanned These constraints may be formulated in terms of a consensus among the trees to be searched We calculate the reduction in the number of trees to be enumerated as a function of properties of the imposed consensusThis work was supported in part by the Natural Sciences and Engineering Research Council of Canada through operating grant A8867 to D Sankoff and infrastructure grant A3092 to D Sankoff, R J Cedergren and G Lapalme We are grateful to William H E Day for much encouragement and many helpful suggestions  相似文献   

18.
The method of nearest-neighbor interchange effects local improvements in a binary tree by replacing a 4-subtree by one of its two alternatives if this improves the objective function. We extend this to k-subtrees to reduce the number of local optima. Possible sequences of k-subtrees to be examined are produced by moving a window over the tree, incorporating one edge at a time while deactivating another. The direction of this movement is chosen according to a hill-climbing strategy. The algorithm includes a backtracking component. Series of simulations of molecular evolution data/parsimony analysis are carried out, fork=4, ..., 8, contrasting the hill-climbing strategy to one based on a random choice of next window, and comparing two stopping rules. Increasing window sizek is found to be the most effective way of improving the local optimum, followed by the choice of hill-climbing over the random strategy. A suggestion for achieving higher values ofk is based on a recursive use of the hill-climbing strategy.Acknowledgments: This work was supported in part by grants to the first author from the Natural Sciences and Engineering Research Council (Canada) and theFonds pour la formation de chercheurs et l'aide à la recherche (Québec), and to the third author from the Danish Research Council. The first author is a fellow of the Canadian Institute for Advanced Research. Much of the research was carried out in the spring of 1991 while the first author was visiting the University of Geneva; warmest thanks are due Professor Claude Weber for this opportunity.  相似文献   

19.
An error variance approach to two-mode hierarchical clustering   总被引:2,自引:2,他引:0  
A new agglomerative method is proposed for the simultaneous hierarchical clustering of row and column elements of a two-mode data matrix. The procedure yields a nested sequence of partitions of the union of two sets of entities (modes). A two-mode cluster is defined as the union of subsets of the respective modes. At each step of the agglomerative process, the algorithm merges those clusters whose fusion results in the smallest possible increase in an internal heterogeneity measure. This measure takes into account both the variance within the respective cluster and its centroid effect defined as the squared deviation of its mean from the maximum entry in the input matrix. The procedure optionally yields an overlapping cluster solution by assigning further row and/or column elements to clusters existing at a preselected hierarchical level. Applications to real data sets drawn from consumer research concerning brand-switching behavior and from personality research concerning the interaction of behaviors and situations demonstrate the efficacy of the method at revealing the underlying two-mode similarity structure.  相似文献   

20.
Maximum sum-of-splits clustering   总被引:1,自引:1,他引:0  
ConsiderN entities to be classified, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We study here the average split of the clusters or, equivalently, the sum of splits. A (N 2) algorithm is provided to determine maximum sum-of-splits partitions intoM clusters for allM betweenN – 1 and 2, using the dual graph of the single-linkage dendrogram.
Résumé SoientN objets à classifier et une matrice de dissimilarit és entre paires de ces objets. L'écart d'une classe est la plus petite dissimilarité entre un objet de cette classe et un objet en dehors d'elle. L'algorithme du lien simple fournit des partitions enM classes dont le plus petit écart est maximum. On étudie l'écart moyen des classes, ou, ce qui est équivalent, la somme des écarts. On propose un algorithme en (N 2) pour déterminer des partitions enM classes dont la somme des écarts est maximum pourM allant deN – 1 à 2, basé sur le graphe dual du dendrogramme de la méthode du lien simple.
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号