期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Fionn Murtagh 《Journal of Classification》2009,26(3):249-277

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling. 相似文献

2.

Directed Binary Hierarchies and Directed Ultrametrics

I-C. Lerman P. Kuntz 《Journal of Classification》2011,28(3):272-296

Directed binary hierarchies have been introduced in order to give a graphical reduced representation of a family of association rules. This type of structure extends the classical binary hierarchical classification in a very specific way. In this paper an accurate formalization of this new structure is studied. A directed hierarchy is defined as a set of ordered pairs of subsets of the initial individual set satisfying specific conditions. A new notion of directed ultrametricity is studied. The main result consists in establishing a bijective correspondence between a directed ultrametric space and a directed binary hierarchy. Finally, an algorithm is proposed in order to transform a directed ultrametric structure into a graphical representation associated with a directed binary hierarchy. 相似文献

3.

L 1 Optimization under Linear Inequality Constraints

T.J. Smith 《Journal of Classification》2000,17(2):225-242

1 optimization under linear inequality constraints based upon iteratively reweighted iterative projection (or IRIP). IRIP is compared to a linear programming (LP) strategy for L₁ minimization (Sp?th 1987, Chapter 5.3) using the ultrametric condition as an exemlar class of constraints to be fitted. Coded for general constraints, the LP approach proves to be faster. Both methods, however, suffer from a serious limitation in being unable to process reasonably-sized data sets because of storage requirements for the constraints. When the simplicity of vector projections is used to allow IRIP to be coded for specific (in this case, ultrametric) constraints, we obtain a fast and efficient algorithm capable of handling large data sets. It is also possible to extend IRIP to operate as a heuristic search strategy that simultaneously identifies both a reasonable set of constraints to impose and the optimally-estimated parameters satisfying these constraints. A few noteworthy characteristics of L₁ optimal ultrametrics are discussed, including other strategies for reformulating the ultrametric optimization problem. 相似文献

4.

Fast, Linear Time Hierarchical Clustering using the Baire Metric

Pedro Contreras Fionn Murtagh 《Journal of Classification》2012,29(2):118-143

The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. In this work we evaluate empirically this new approach to hierarchical clustering. We compare hierarchical clustering based on the Baire metric with (i) agglomerative hierarchical clustering, in terms of algorithm properties; (ii) generalized ultrametrics, in terms of definition; and (iii) fast clustering through k-means partitioning, in terms of quality of results. For the latter, we carry out an in depth astronomical study. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more costly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we use clusterwise regression for this. 相似文献

5.

Degenerating Families of Dendrograms

Patrick Erik Bradley 《Journal of Classification》2008,25(1):27-42

Dendrograms used in data analysis are ultrametric spaces, hence objects of nonarchimedean geometry. It is known that there exist p-adic representations of dendrograms. Completed by a point at infinity, they can be viewed as subtrees of the Bruhat-Tits tree associated to the p-adic projective line. The implications are that certain moduli spaces known in algebraic geometry are in fact p-adic parameter spaces of dendrograms, and stochastic classification can also be handled within this framework. At the end, we calculate the topology of the hidden part of a dendrogram. 相似文献

6.

Metric family portraits

George W. Furnas 《Journal of Classification》1989,6(1):7-52

By associating a whole distance matrix with a single point in a parameter space, a family of matrices (e.g., all those obeying the triangle inequality) can be shown as a cloud of points. Pictures of the cloud form a family portrait, and its characteristic shape and interrelationship with the portraits of other families can be explored. Critchley (unpublished) used this approach to illustrate, for distances between three points, algebraic results on the nesting relations between various metrics. In this paper, these diagrams are further investigated and then generalized. In the first generalization, projective geometry is used to allow the geometric representation of Additive Mixture, Additive Constant, and Missing Data problems. Then the six-dimensional portraits of four-point distance matrices are studied, revealing differences between the Euclidean, Additive Tree, and General Metric families. The paper concludes with caveats and insights concerning families of generaln-point metric matrices. 相似文献

7.

Canonical Analysis: Ranks,Ratios and Fits

Casper J. Albers John C. Gower 《Journal of Classification》2014,31(1):2-27

Measurements of p variables for n samples are collected into a n×p matrix X, where the samples belong to one of k groups. The group means are separated by Mahalanobis distances. CVA optimally represents the group means of X in an r-dimensional space. This can be done by maximizing a ratio criterion (basically one- dimensional) or, more flexibly, by minimizing a rank-constrained least-squares fitting criterion (which is not confined to being one-dimensional but depends on defining an appropriate Mahalanobis metric). In modern n < p problems, where W is not of full rank, the ratio criterion is shown not to be coherent but the fit criterion, with an attention to associated metrics, readily generalizes. In this context we give a unified generalization of CVA, introducing two metrics, one in the range space of W and the other in the null space of W, that have links with Mahalanobis distance. This generalization is computationally efficient, since it requires only the spectral decomposition of a n×n matrix. 相似文献

8.

Piecewise hierarchical clustering 总被引：1，自引：0，他引：1

Gildas Brossier 《Journal of Classification》1990,7(2):197-216

We consider two or more ultrametric distance matrices defined over different, possibly overlapping, subsets. These matrices are merged into one ultrametric matrix defined over the whole set. Necessary and sufficient conditions for uniqueness of the merging are established. when these conditions are not satisfied, consistent algorithms are given.The author thanks B. Monjardet and two anonymous referees for their helpful comments. 相似文献

9.

A validation study of a variable weighting algorithm for cluster analysis 总被引：1，自引：0，他引：1

Glenn W. Milligan 《Journal of Classification》1989,6(1):53-71

De Soete (1986, 1988) proposed a variable weighting procedure when Euclidean distance is used as the dissimilarity measure with an ultrametric hierarchical clustering method. The algorithm produces weighted distances which approximate ultrametric distances as closely as possible in a least squares sense. The present simulation study examined the effectiveness of the De Soete procedure for an applications problem for which it was not originally intended. That is, to determine whether or not the algorithm can be used to reduce the influence of variables which are irrelevant to the clustering present in the data. The simulation study examined the ability of the procedure to recover a variety of known underlying cluster structures. The results indicate that the algorithm is effective in identifying extraneous variables which do not contribute information about the true cluster structure. Weights near 0.0 were typically assigned to such extraneous variables. Furthermore, the variable weighting procedure was not adversely effected by the presence of other forms of error in the data. In general, it is recommended that the variable weighting procedure be used for applied analyses when Euclidean distance is employed with ultrametric hierarchical clustering methods. 相似文献

10.

Between-Group Metrics 总被引：1，自引：1，他引：0

John C. Gower Casper J. Albers 《Journal of Classification》2011,28(3):315-326

In canonical analysis with more variables than samples, it is shown that, as well as the usual canonical means in the range-space of the within-groups dispersion matrix, canonical means may be defined in its null space. In the range space we have the usual Mahalanobis metric; in the null space explicit expressions are given and interpreted for a new metric. 相似文献

11.

Constructing Ultrametric and Additive Trees Based on the L 1 Norm

Thomas J. Smith 《Journal of Classification》2001,18(2):185-207

L ₁) criterion. Examples of ultrametric and additive trees fitted to two extant data sets are given, plus a Monte Carlo analysis to assess the impact of both typical data error and extreme values on fitted trees. Solutions are compared to the least-squares (L ₂) approach of Hubert and Arabie (1995a), with results indicating that (with these data) the L ₁ and L ₂ optimization strategies perform very similarly. A number of observations are made concerning possible uses of an L ₁ approach, the nature and number of identified locally optimal solutions, and metric recovery differences between ultrametrics and additive trees. 相似文献

12.

Radial basis functions for exploratory data analysis: An iterative majorisation approach for Minkowski distances based on multidimensional scaling

Andrew R. Webb 《Journal of Classification》1997,14(2):249-267

This paper considers the use of radial basis functions for exploratory data analysis. These are used to model a transformation from a high-dimensional observation space to a low-dimensional one. The parameters of the model are determined by optimising a loss function defined to be the stress function in multidimensional scaling. The metric for the low-dimensional space is taken to be the Minkowski metric with order parameter 1<-p<-2. A scheme based on iterative majorisation is proposed. 相似文献

13.

Three-way distances

S. Joly G. Le Calvé 《Journal of Classification》1995,12(2):191-205

In this paper, dissimilarity relations are defined on triples rather than on dyads. We give a definition of a three-way distance analogous to that of the ordinary two-way distance. It is shown, as a straightforward generalization, that it is possible to define three-way ultrametric, three-way star, and three-way Euclidean distances. Special attention is paid to a model called the semi-perimeter model. We construct new methods analogous to the existing ones for ordinary distances, for example: principal coordinates analysis, the generalized Prim (1957) algorithm, hierarchical cluster analysis. 相似文献

14.

Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software 总被引：1，自引：0，他引：1

Vladimir Makarenkov Pierre Legendre 《Journal of Classification》2001,18(2):245-271

K -means partitioning. We also describe some new features and improvements to the algorithm proposed by De Soete. Monte Carlo simulations have been conducted using different error conditions. In all cases (i.e., ultrametric or additive trees, or K-means partitioning), the simulation results indicate that the optimal weighting procedure should be used for analyzing data containing noisy variables that do not contribute relevant information to the classification structure. However, if the data involve error-perturbed variables that are relevant to the classification or outliers, it seems better to cluster or partition the entities by using variables with equal weights. A new computer program, OVW, which is available to researchers as freeware, implements improved algorithms for optimal variable weighting for ultrametric and additive tree clustering, and includes a new algorithm for optimal variable weighting for K-means partitioning. 相似文献

15.

Least squares algorithms for constructing constrained ultrametric and additive tree representations of symmetric proximity data 总被引：1，自引：1，他引：0

Geert De Soete J. Douglas Carroll Wayne S. DeSarbo 《Journal of Classification》1987,4(2):155-173

A mathematical programming algorithm is developed for fitting ultrametric or additive trees to proximity data where external constraints are imposed on the topology of the tree. The two procedures minimize a least squares loss function. The method is illustrated on both synthetic and real data. A constrained ultrametric tree analysis was performed on similarities between 32 subjects based on preferences for ten odors, while a constrained additive tree analysis was carried out on some proximity data between kinship terms. Finally, some extensions of the methodology to other tree fitting procedures are mentioned.The first author is supported as Bevoegdverklaard Navorser of the Belgian Nationaal Fonds voor Wetenschappelijk Onderzoek. 相似文献

16.

Assessing Congruence Among Ultrametric Distance Matrices

Véronique Campbell Pierre Legendre François-Joseph Lapointe 《Journal of Classification》2009,26(1):103-117

相似文献

17.

Metric inference for social networks

David Banks Kathleen Carley 《Journal of Classification》1994,11(1):121-149

Using a natural metric on the space of networks, we define a probability measure for network-valued random variables. This measure is indexed by two parameters, which are interpretable as a location parameter and a dispersion parameter. From this structure, one can develop maximum likelihood estimates, hypothesis tests and confidence regions, all in the context of independent and identically distributed networks. The value of this perspective is illustrated through application to portions of the friedship cognitive social structure data gathered by Krackhardt (1987).We thank Ove Frank, David Krackhardt, the editor and the referees for their constructive comments and suggestions. 相似文献

18.

A computationally efficient approximation to the nearest neighbor interchange metric 总被引：3，自引：3，他引：0

Edward K. Brown William H. E. Day 《Journal of Classification》1984,1(1):93-124

The nearest neighbor interchange (nni) metric is a distance measure providing a quantitative measure of dissimilarity between two unrooted binary trees with labeled leaves. The metric has a transparent definition in terms of a simple transformation of binary trees, but its use in nontrivial problems is usually prevented by the absence of a computationally efficient algorithm. Since recent attempts to discover such an algorithm continue to be unsuccessful, we address the complementary problem of designing an approximation to the nni metric. Such an approximation should be well-defined, efficient to compute, comprehensible to users, relevant to applications, and a close fit to the nni metric; the challenge, of course, is to compromise these objectives in such a way that the final design is acceptable to users with practical and theoretical orientations. We describe an approximation algorithm that appears to satisfy adequately these objectives. The algorithm requires O(n) space to compute dissimilarity between binary trees withn labeled leaves; it requires O(n logn) time for rooted trees and O(n ² logn) time for unrooted trees. To help the user interpret the dissimilarity measures based on this algorithm, we describe empirical distributions of dissimilarities between pairs of randomly selected trees for both rooted and unrooted cases.The Natural Sciences and Engineering Research Council of Canada partially supported this work with Grant A-4142. 相似文献

19.

A graph-theoretic method for organizing overlapping clusters into trees,multiple trees,or extended trees

Carroll J. Douglas Corter James E. 《Journal of Classification》1995,12(2):283-313

A clustering that consists of a nested set of clusters may be represented graphically by a tree. In contrast, a clustering that includes non-nested overlapping clusters (sometimes termed a “nonhierarchical” clustering) cannot be represented by a tree. Graphical representations of such non-nested overlapping clusterings are usually complex and difficult to interpret. Carroll and Pruzansky (1975, 1980) suggested representing non-nested clusterings with multiple ultrametric or additive trees. Corter and Tversky (1986) introduced the extended tree (EXTREE) model, which represents a non-nested structure as a tree plus overlapping clusters that are represented by marked segments in the tree. We show here that the problem of finding a nested (i.e., tree-structured) set of clusters in an overlapping clustering can be reformulated as the problem of finding a clique in a graph. Thus, clique-finding algorithms can be used to identify sets of clusters in the solution that can be represented by trees. This formulation provides a means of automatically constructing a multiple tree or extended tree representation of any non-nested clustering. The method, called “clustrees”, is applied to several non-nested overlapping clusterings derived using the MAPCLUS program (Arabie and Carroll 1980). 相似文献

20.

Ultrametric tree representations of incomplete dissimilarity data 总被引：2，自引：2，他引：0

Geert De Soete 《Journal of Classification》1984,1(1):235-242

The least squares algorithm for fitting ultrametric trees to proximity data originally proposed by Carroll and Pruzansky and further elaborated by De Soete is extended to handle missing data. A Monte Carlo evaluation reveals that the algorithm is capable of recovering an ultrametric tree underlying an incomplete set of error-perturbed dissimilarities quite well.Geert De Soete is Aangesteld Navorser of the Belgian National Fonds voor Wetenschappelijk Onderzoek. 相似文献