首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present a new distance based quartet method for phylogenetic tree reconstruction, called Minimum Tree Cost Quartet Puzzling. Starting from a distance matrix computed from natural data, the algorithm incrementally constructs a tree by adding one taxon at a time to the intermediary tree using a cost function based on the relaxed 4-point condition for weighting quartets. Different input orders of taxa lead to trees having distinct topologies which can be evaluated using a maximum likelihood or weighted least squares optimality criterion. Using reduced sets of quartets and a simple heuristic tree search strategy we obtain an overall complexity of O(n 5 log2 n) for the algorithm. We evaluate the performances of the method through comparative tests and show that our method outperforms NJ when a weighted least squares optimality criterion is employed. We also discuss the theoretical boundaries of the algorithm.  相似文献   

2.
3.
A two-level data set consists of entities of a higher level (say populations), each one being composed of several units of the lower level (say individuals). Observations are made at the individual level, whereas population characteristics are aggregated from individual data. Cluster analysis with subsampling of populations is a cluster analysis based on individual data that aims at clustering populations rather than individuals. In this article, we extend existing optimality criteria for cluster analysis with subsampling of populations to deal with situations where population characteristics are not the mean of individual data. A new criterion that depends on the Mahalanobis distance is also defined. The criteria are compared using simulated examples and an ecological data set of tree species in a tropical rain forest.  相似文献   

4.
A natural extension of classical metric multidimensional scaling is proposed. The result is a new formulation of nonmetric multidimensional scaling in which the strain criterion is minimized subject to order constraints on the disparity variables. Innovative features of the new formulation include: the parametrization of the p-dimensional distance matrices by the positive semidefinite matrices of rank ≤p; optimization of the (squared) disparity variables, rather than the configuration coordinate variables; and a new nondegeneracy constraint, which restricts the set of (squared) disparities rather than the set of distances. Solutions are obtained using an easily implemented gradient projection method for numerical optimization. The method is applied to two published data sets.  相似文献   

5.
最小编辑距离是比较语言中不同符号串之间相似程度的一种方法,这种方法计算不同符号串之间转换时的删除、插入、替代等运算的操作数,通过动态规划算法进行算法描述。在术语研究中,可以使用最小编辑距离对术语特征进行定量化计算。在计算语言学中,可以使用最小编辑距离发现潜在的拼写错误,进行错拼更正。在语音识别中,可以使用最小编辑距离计算单词的错误率。在机器翻译中,可以使用最小编辑距离进行双语语料库的单词对齐。  相似文献   

6.
Spectral analysis of phylogenetic data   总被引:12,自引:0,他引:12  
The spectral analysis of sequence and distance data is a new approach to phylogenetic analysis. For two-state character sequences, the character values at a given site split the set of taxa into two subsets, a bipartition of the taxa set. The vector which counts the relative numbers of each of these bipartitions over all sites is called a sequence spectrum. Applying a transformation called a Hadamard conjugation, the sequence spectrum is transformed to the conjugate spectrum. This conjugation corrects for unobserved changes in the data, independently from the choice of phylogenetic tree. For any given phylogenetic tree with edge weights (probabilities of state change), we define a corresponding tree spectrum. The selection of a weighted phylogenetic tree from the given sequence data is made by matching the conjugate spectrum with a tree spectrum. We develop an optimality selection procedure using a least squares best fit, to find the phylogenetic tree whose tree spectrum most closely matches the conjugate spectrum. An inferred sequence spectrum can be derived from the selected tree spectrum using the inverse Hadamard conjugation to allow a comparison with the original sequence spectrum. A possible adaptation for the analysis of four-state character sequences with unequal frequencies is considered. A corresponding spectral analysis for distance data is also introduced. These analyses are illustrated with biological examples for both distance and sequence data. Spectral analysis using the Fast Hadamard transform allows optimal trees to be found for at least 20 taxa and perhaps for up to 30 taxa. The development presented here is self contained, although some mathematical proofs available elsewhere have been omitted. The analysis of sequence data is based on methods reported earlier, but the terminology and the application to distance data are new.  相似文献   

7.
Complete linkage as a multiple stopping rule for single linkage clustering   总被引:2,自引:2,他引:0  
Two commonly used clustering criteria are single linkage, which maximizes the minimum distance between clusters, and complete linkage, which minimizes the maximum distance within a cluster. By synthesizing these criteria, partitions of objects are sought which maximize a combined measure of the minimum distance between clusters and the maximum distance within a cluster. Each combined measure is shown to select a partition in the single linkage hierarchy. Therefore, in effect, complete linkage is used to provide a stopping rule for single linkage. An algorithm is outlined which uses the distance between each pair of objects twice only. To illustrate the method, an example is given using 23 Glamorganshire soil profiles.  相似文献   

8.
Optimization Strategies for Two-Mode Partitioning   总被引:2,自引:2,他引:0  
Two-mode partitioning is a relatively new form of clustering that clusters both rows and columns of a data matrix. In this paper, we consider deterministic two-mode partitioning methods in which a criterion similar to k-means is optimized. A variety of optimization methods have been proposed for this type of problem. However, it is still unclear which method should be used, as various methods may lead to non-global optima. This paper reviews and compares several optimization methods for two-mode partitioning. Several known methods are discussed, and a new fuzzy steps method is introduced. The fuzzy steps method is based on the fuzzy c-means algorithm of Bezdek (1981) and the fuzzy steps approach of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performances of all methods are compared in a large simulation study. In our simulations, a two-mode k-means optimization method most often gives the best results. Finally, an empirical data set is used to give a practical example of two-mode partitioning. We would like to thank two anonymous referees whose comments have improved the quality of this paper. We are also grateful to Peter Verhoef for providing the data set used in this paper.  相似文献   

9.
We consider applying a functional logistic discriminant procedure to the analysis of handwritten character data. Time-course trajectories corresponding to the X and Y coordinate values of handwritten characters written in the air with one finger are converted into a functional data set via regularized basis expansion. We then apply functional logistic modeling to classify the functions into several classes. In order to select the values of adjusted parameters involved in the functional logistic model, we derive a model selection criterion for evaluating models estimated by the method of regularization. Results indicate the effectiveness of our modeling strategy in terms of prediction accuracy.  相似文献   

10.
In many application fields, multivariate approaches that simultaneously consider the correlation between responses are needed. The tree method can be extended to multivariate responses, such as repeated measure and longitudinal data, by modifying the split function so as to accommodate multiple responses. Recently, researchers have constructed some decision trees for multiple continuous longitudinal response and multiple binary responses using Mahalanobis distance and a generalized entropy index. However, these methods have limitations according to the type of response, that is, those that are only continuous or binary. In this paper, we will modify the tree for univariate response procedure and suggest a new tree-based method that can analyze any type of multiple responses by using GEE (generalized estimating equations) techniques. To compare the performance of trees, simulation studies on selection probability of true split variable will be shown. Finally, applications using epileptic seizure data and WWW data are introduced.  相似文献   

11.
Multidimensional scaling in the city-block metric: A combinatorial approach   总被引:1,自引:1,他引:0  
We present an approach, independent of the common gradient-based necessary conditions for obtaining a (locally) optimal solution, to multidimensional scaling using the city-block distance function, and implementable in either a metric or nonmetric context. The difficulties encountered in relying on a gradient-based strategy are first reviewed: the general weakness in indicating a good solution that is implied by the satisfaction of the necessary condition of a zero gradient, and the possibility of actual nonconvergence of the associated optimization strategy. To avoid the dependence on gradients for guiding the optimization technique, an alternative iterative procedure is proposed that incorporates (a) combinatorial optimization to construct good object orders along the chosen number of dimensions and (b) nonnegative least-squares to re-estimate the coordinates for the objects based on the object orders. The re-estimated coordinates are used to improve upon the given object orders, which may in turn lead to better coordinates, and so on until convergence of the entire process occurs to a (locally) optimal solution. The approach is illustrated through several data sets on the perception of similarity of rectangles and compared to the results obtained with a gradient-based method.  相似文献   

12.
当代西方科学哲学正在朝着认知主义发展,它们在理解科学本质时仍存在内在论与外在论、自然化与社会化的争论。本文认为,要正确认识科学的本质问题,就必须把认识论置于“文化建构论”之上。以文化建构论为基础,辩证地理解认识的形成、发展、认识的标准、科学的本质等等。  相似文献   

13.
We construct a weighted Euclidean distance that approximates any distance or dissimilarity measure between individuals that is based on a rectangular cases-by-variables data matrix. In contrast to regular multidimensional scaling methods for dissimilarity data, our approach leads to biplots of individuals and variables while preserving all the good properties of dimension-reduction methods that are based on the singular-value decomposition. The main benefits are the decomposition of variance into components along principal axes, which provide the numerical diagnostics known as contributions, and the estimation of nonnegative weights for each variable. The idea is inspired by the distance functions used in correspondence analysis and in principal component analysis of standardized data, where the normalizations inherent in the distances can be considered as differential weighting of the variables. In weighted Euclidean biplots, we allow these weights to be unknown parameters, which are estimated from the data to maximize the fit to the chosen distances or dissimilarities. These weights are estimated using a majorization algorithm. Once this extra weight-estimation step is accomplished, the procedure follows the classical path in decomposing the matrix and displaying its rows and columns in biplots.  相似文献   

14.
This paper studies the problem of estimating the number of clusters in the context of logistic regression clustering. The classification likelihood approach is employed to tackle this problem. A model-selection based criterion for selecting the number of logistic curves is proposed and its asymptotic property is also considered. The small sample performance of the proposed criterion is studied by Monto Carlo simulation. In addition, a real data example is presented. The authors would like to thank the editor, Prof. Willem J. Heiser, and the anonymous referees for the valuable comments and suggestions, which have led to the improvement of this paper.  相似文献   

15.
end-member model . A major drawback of the latent budget model is that, in general, the model is not identifiable, which complicates the interpretation of the model considerably. This paper studies the geometry and identifiability of the latent budget model. Knowledge of the geometric structure of the model is used to specify an appropriate criterion to identify the model. The results are illustrated by an empirical data set.  相似文献   

16.
NJ by K that represents N individuals' choices among K categories over J time points. The row and column scores of this univariate data matrix cannot be chosen uniquely by any standard optimal scaling technique. To approach this difficulty, we present a regularized method, in which the scores of individuals over time points (i.e. row scores) are represented using natural cubic splines. The loss of their smoothness is combined with the loss of homeogeneity underlying the standard technique to form a penalized loss function which is minimized under a normalization constraint. A graphical representation of the resulting scores allows us easily to grasp the longitudinal changes in individuals. Simulation analysis is performed to evaluate how well the method recovers true scores, and real data are analyzed for illustration.  相似文献   

17.
Variable Selection for Clustering and Classification   总被引:2,自引:2,他引:0  
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.  相似文献   

18.
19.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach.  相似文献   

20.
Measurements of p variables for n samples are collected into a n×p matrix X, where the samples belong to one of k groups. The group means are separated by Mahalanobis distances. CVA optimally represents the group means of X in an r-dimensional space. This can be done by maximizing a ratio criterion (basically one- dimensional) or, more flexibly, by minimizing a rank-constrained least-squares fitting criterion (which is not confined to being one-dimensional but depends on defining an appropriate Mahalanobis metric). In modern n < p problems, where W is not of full rank, the ratio criterion is shown not to be coherent but the fit criterion, with an attention to associated metrics, readily generalizes. In this context we give a unified generalization of CVA, introducing two metrics, one in the range space of W and the other in the null space of W, that have links with Mahalanobis distance. This generalization is computationally efficient, since it requires only the spectral decomposition of a n×n matrix.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号