共查询到20条相似文献,搜索用时 31 毫秒
1.
Tudor B. Ionescu Géraldine Polaillon Frédéric Boulanger 《Journal of Classification》2010,27(2):136-157
We present a new distance based quartet method for phylogenetic tree reconstruction, called Minimum Tree Cost Quartet Puzzling.
Starting from a distance matrix computed from natural data, the algorithm incrementally constructs a tree by adding one taxon
at a time to the intermediary tree using a cost function based on the relaxed 4-point condition for weighting quartets. Different
input orders of taxa lead to trees having distinct topologies which can be evaluated using a maximum likelihood or weighted
least squares optimality criterion. Using reduced sets of quartets and a simple heuristic tree search strategy we obtain an
overall complexity of O(n
5 log2
n) for the algorithm. We evaluate the performances of the method through comparative tests and show that our method outperforms
NJ when a weighted least squares optimality criterion is employed. We also discuss the theoretical boundaries of the algorithm. 相似文献
2.
3.
A two-level data set consists of entities of a higher level (say populations), each one being composed of several units of
the lower level (say individuals). Observations are made at the individual level, whereas population characteristics are aggregated
from individual data. Cluster analysis with subsampling of populations is a cluster analysis based on individual data that
aims at clustering populations rather than individuals. In this article, we extend existing optimality criteria for cluster
analysis with subsampling of populations to deal with situations where population characteristics are not the mean of individual
data. A new criterion that depends on the Mahalanobis distance is also defined. The criteria are compared using simulated
examples and an ecological data set of tree species in a tropical rain forest. 相似文献
4.
Michael W. Trosset 《Journal of Classification》1998,15(1):15-35
A natural extension of classical metric multidimensional scaling is proposed. The result is a new formulation of nonmetric
multidimensional scaling in which the strain criterion is minimized subject to order constraints on the disparity variables.
Innovative features of the new formulation include: the parametrization of the p-dimensional distance matrices by the positive semidefinite matrices of rank ≤p; optimization of the (squared) disparity variables, rather than the configuration coordinate variables; and a new nondegeneracy
constraint, which restricts the set of (squared) disparities rather than the set of distances. Solutions are obtained using
an easily implemented gradient projection method for numerical optimization. The method is applied to two published data
sets. 相似文献
5.
6.
Spectral analysis of phylogenetic data 总被引:12,自引:0,他引:12
The spectral analysis of sequence and distance data is a new approach to phylogenetic analysis. For two-state character sequences,
the character values at a given site split the set of taxa into two subsets, a bipartition of the taxa set. The vector which
counts the relative numbers of each of these bipartitions over all sites is called a sequence spectrum. Applying a transformation
called a Hadamard conjugation, the sequence spectrum is transformed to the conjugate spectrum. This conjugation corrects for
unobserved changes in the data, independently from the choice of phylogenetic tree. For any given phylogenetic tree with edge
weights (probabilities of state change), we define a corresponding tree spectrum. The selection of a weighted phylogenetic
tree from the given sequence data is made by matching the conjugate spectrum with a tree spectrum. We develop an optimality
selection procedure using a least squares best fit, to find the phylogenetic tree whose tree spectrum most closely matches
the conjugate spectrum. An inferred sequence spectrum can be derived from the selected tree spectrum using the inverse Hadamard
conjugation to allow a comparison with the original sequence spectrum.
A possible adaptation for the analysis of four-state character sequences with unequal frequencies is considered. A corresponding
spectral analysis for distance data is also introduced. These analyses are illustrated with biological examples for both distance
and sequence data. Spectral analysis using the Fast Hadamard transform allows optimal trees to be found for at least 20 taxa
and perhaps for up to 30 taxa.
The development presented here is self contained, although some mathematical proofs available elsewhere have been omitted.
The analysis of sequence data is based on methods reported earlier, but the terminology and the application to distance data
are new. 相似文献
7.
C. A. Glasbey 《Journal of Classification》1987,4(1):103-109
Two commonly used clustering criteria are single linkage, which maximizes the minimum distance between clusters, and complete linkage, which minimizes the maximum distance within a cluster. By synthesizing these criteria, partitions of objects are sought which maximize a combined measure of the minimum distance between clusters and the maximum distance within a cluster. Each combined measure is shown to select a partition in the single linkage hierarchy. Therefore, in effect, complete linkage is used to provide a stopping rule for single linkage. An algorithm is outlined which uses the distance between each pair of objects twice only. To illustrate the method, an example is given using 23 Glamorganshire soil profiles. 相似文献
8.
Optimization Strategies for Two-Mode Partitioning 总被引:2,自引:2,他引:0
Joost van Rosmalen Patrick J. F. Groenen Javier Trejos William Castillo 《Journal of Classification》2009,26(2):155-181
Two-mode partitioning is a relatively new form of clustering that clusters both rows and columns of a data matrix. In this
paper, we consider deterministic two-mode partitioning methods in which a criterion similar to k-means is optimized. A variety of optimization methods have been proposed for this type of problem. However, it is still unclear
which method should be used, as various methods may lead to non-global optima. This paper reviews and compares several optimization
methods for two-mode partitioning. Several known methods are discussed, and a new fuzzy steps method is introduced. The fuzzy
steps method is based on the fuzzy c-means algorithm of Bezdek (1981) and the fuzzy steps approach of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performances of all methods are compared in a large simulation study. In our simulations, a two-mode k-means optimization method most often gives the best results. Finally, an empirical data set is used to give a practical example
of two-mode partitioning.
We would like to thank two anonymous referees whose comments have improved the quality of this paper. We are also grateful
to Peter Verhoef for providing the data set used in this paper. 相似文献
9.
Multiclass Functional Discriminant Analysis and Its Application to Gesture Recognition 总被引:1,自引:1,他引:0
We consider applying a functional logistic discriminant procedure to the analysis of handwritten character data. Time-course
trajectories corresponding to the X and Y coordinate values of handwritten characters written in the air with one finger are
converted into a functional data set via regularized basis expansion. We then apply functional logistic modeling to classify
the functions into several classes. In order to select the values of adjusted parameters involved in the functional logistic
model, we derive a model selection criterion for evaluating models estimated by the method of regularization. Results indicate
the effectiveness of our modeling strategy in terms of prediction accuracy. 相似文献
10.
Seong Keon Lee 《Journal of Classification》2006,23(1):123-141
In many application fields, multivariate approaches that simultaneously consider the correlation between responses are needed.
The tree method can be extended to multivariate responses, such as repeated measure and longitudinal data, by modifying the
split function so as to accommodate multiple responses. Recently, researchers have constructed some decision trees for multiple
continuous longitudinal response and multiple binary responses using Mahalanobis distance and a generalized entropy index.
However, these methods have limitations according to the type of response, that is, those that are only continuous or binary.
In this paper, we will modify the tree for univariate response procedure and suggest a new tree-based method that can analyze
any type of multiple responses by using GEE (generalized estimating equations) techniques. To compare the performance of trees,
simulation studies on selection probability of true split variable will be shown. Finally, applications using epileptic seizure
data and WWW data are introduced. 相似文献
11.
We present an approach, independent of the common gradient-based necessary conditions for obtaining a (locally) optimal solution,
to multidimensional scaling using the city-block distance function, and implementable in either a metric or nonmetric context.
The difficulties encountered in relying on a gradient-based strategy are first reviewed: the general weakness in indicating
a good solution that is implied by the satisfaction of the necessary condition of a zero gradient, and the possibility of
actual nonconvergence of the associated optimization strategy. To avoid the dependence on gradients for guiding the optimization
technique, an alternative iterative procedure is proposed that incorporates (a) combinatorial optimization to construct good
object orders along the chosen number of dimensions and (b) nonnegative least-squares to re-estimate the coordinates for the
objects based on the object orders. The re-estimated coordinates are used to improve upon the given object orders, which may
in turn lead to better coordinates, and so on until convergence of the entire process occurs to a (locally) optimal solution.
The approach is illustrated through several data sets on the perception of similarity of rectangles and compared to the results
obtained with a gradient-based method. 相似文献
12.
当代西方科学哲学正在朝着认知主义发展,它们在理解科学本质时仍存在内在论与外在论、自然化与社会化的争论。本文认为,要正确认识科学的本质问题,就必须把认识论置于“文化建构论”之上。以文化建构论为基础,辩证地理解认识的形成、发展、认识的标准、科学的本质等等。 相似文献
13.
We construct a weighted Euclidean distance that approximates any distance or dissimilarity measure between individuals that is based on a rectangular cases-by-variables data matrix. In contrast to regular multidimensional scaling methods for dissimilarity data, our approach leads to biplots of individuals and variables while preserving all the good properties of dimension-reduction methods that are based on the singular-value decomposition. The main benefits are the decomposition of variance into components along principal axes, which provide the numerical diagnostics known as contributions, and the estimation of nonnegative weights for each variable. The idea is inspired by the distance functions used in correspondence analysis and in principal component analysis of standardized data, where the normalizations inherent in the distances can be considered as differential weighting of the variables. In weighted Euclidean biplots, we allow these weights to be unknown parameters, which are estimated from the data to maximize the fit to the chosen distances or dissimilarities. These weights are estimated using a majorization algorithm. Once this extra weight-estimation step is accomplished, the procedure follows the classical path in decomposing the matrix and displaying its rows and columns in biplots. 相似文献
14.
This paper studies the problem of estimating the number of clusters in the context of logistic regression clustering. The
classification likelihood approach is employed to tackle this problem. A model-selection based criterion for selecting the
number of logistic curves is proposed and its asymptotic property is also considered. The small sample performance of the
proposed criterion is studied by Monto Carlo simulation. In addition, a real data example is presented.
The authors would like to thank the editor, Prof. Willem J. Heiser, and the anonymous referees for the valuable comments and
suggestions, which have led to the improvement of this paper. 相似文献
15.
L. Andries van der Ark Peter G. M. van der Heijden Dirk Sikkel 《Journal of Classification》1999,16(1):117-137
end-member
model .
A major drawback of the latent budget model is that, in general, the
model is not identifiable, which complicates the interpretation of the
model considerably. This paper studies the geometry and identifiability
of the latent budget model. Knowledge of the geometric structure of the
model is used to specify an appropriate criterion to identify the model.
The results are illustrated by an empirical data set. 相似文献
16.
Kohei Adachi 《Journal of Classification》2002,19(2):215-248
NJ by K that represents N individuals' choices among K categories over J time points. The row and column scores of this univariate data matrix cannot be chosen uniquely by any standard optimal scaling
technique. To approach this difficulty, we present a regularized method, in which the scores of individuals over time points
(i.e. row scores) are represented using natural cubic splines. The loss of their smoothness is combined with the loss of homeogeneity
underlying the standard technique to form a penalized loss function which is minimized under a normalization constraint. A
graphical representation of the resulting scores allows us easily to grasp the longitudinal changes in individuals. Simulation
analysis is performed to evaluate how well the method recovers true scores, and real data are analyzed for illustration. 相似文献
17.
Variable Selection for Clustering and Classification 总被引:2,自引:2,他引:0
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets. 相似文献
18.
19.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach. 相似文献
20.
Measurements of p variables for n samples are collected into a n×p matrix X, where the samples belong to one of k groups. The group means are separated by Mahalanobis distances. CVA optimally represents the group means of X in an r-dimensional space. This can be done by maximizing a ratio criterion (basically one- dimensional) or, more flexibly, by minimizing a rank-constrained least-squares fitting criterion (which is not confined to being one-dimensional but depends on defining an appropriate Mahalanobis metric). In modern n < p problems, where W is not of full rank, the ratio criterion is shown not to be coherent but the fit criterion, with an attention to associated metrics, readily generalizes. In this context we give a unified generalization of CVA, introducing two metrics, one in the range space of W and the other in the null space of W, that have links with Mahalanobis distance. This generalization is computationally efficient, since it requires only the spectral decomposition of a n×n matrix. 相似文献