首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The distribution of lengths of phylogenetic trees under the taxonomic principle of parsimony is compared with the distribution obtained by randomizing the characters of the sequence data. This comparison allows us to define a measure of the extent to which sequence data contain significant hierarchical information. We show how to calculate this measure exactly for up to 10 taxa, and provide a good approximation for larger sets of taxa. The measure is applied to test sequences on 10 and 15 taxa.  相似文献   

2.
We present a new distance based quartet method for phylogenetic tree reconstruction, called Minimum Tree Cost Quartet Puzzling. Starting from a distance matrix computed from natural data, the algorithm incrementally constructs a tree by adding one taxon at a time to the intermediary tree using a cost function based on the relaxed 4-point condition for weighting quartets. Different input orders of taxa lead to trees having distinct topologies which can be evaluated using a maximum likelihood or weighted least squares optimality criterion. Using reduced sets of quartets and a simple heuristic tree search strategy we obtain an overall complexity of O(n 5 log2 n) for the algorithm. We evaluate the performances of the method through comparative tests and show that our method outperforms NJ when a weighted least squares optimality criterion is employed. We also discuss the theoretical boundaries of the algorithm.  相似文献   

3.
A new method, TreeOfTrees, is proposed to compare X-tree structures obtained from several sets of aligned gene sequences of the same taxa. Its aim is to detect genes or sets of genes having different evolutionary histories. The comparison between sets of trees is based on several tree metrics, leading to a unique tree labelled by the gene trees. The robustness values of its edges are estimated by bootstrapping and consensus procedures that allow detecting subsets of genes having differently evolved. Simulations are performed under various evolutionary conditions to test the efficiency of the method and an application on real data is described. Tests of arboricity and various consensus algorithms are also discussed. A corresponding software package is available.  相似文献   

4.
A two-level data set consists of entities of a higher level (say populations), each one being composed of several units of the lower level (say individuals). Observations are made at the individual level, whereas population characteristics are aggregated from individual data. Cluster analysis with subsampling of populations is a cluster analysis based on individual data that aims at clustering populations rather than individuals. In this article, we extend existing optimality criteria for cluster analysis with subsampling of populations to deal with situations where population characteristics are not the mean of individual data. A new criterion that depends on the Mahalanobis distance is also defined. The criteria are compared using simulated examples and an ecological data set of tree species in a tropical rain forest.  相似文献   

5.
The Neighbor-Joining (NJ) method of Saitou and Nei is the most widely used distance based method in phylogenetic analysis. Central to the method is the selection criterion, the formula used to choose which pair of objects to amalgamate next. Here we analyze the NJ selection criterion using an axiomatic approach. We show that any selection criterion that is linear, permutation equivariant, statistically consistent and based solely on distance data will give the same trees as those created by NJ.  相似文献   

6.
Statistical analyses of a published phylogenetic classification of languages show some properties attributable to taxonomic methods and others that reflect the nature of linguistic evolution. The inferred phylogenetic tree is less well resolved and more asymmetric at the highest taxonomic ranks, where the tree is constructed mainly by phenetic methods. At lower ranks, where cladistic methods are more prevalent, the asymmetry of well resolved parts of the tree is consistent with a stochastic birth and death process in which languages originate and become extinct at constant rates, although poorly resolved parts of the tree are still more asymmetric than predicted. Other tests applied to a sample of historically recorded languages reveal substantial fluctuations in the rates of origination and extinction, with both rates temporarily reduced when languages enter the historical record. For languages in general, the average origination rate is estimated to be only slightly higher than the average extinction rate, which in turn corresponds to an average lifetime of about 500 years or less.This research was suported by a grant from the UCLA Academic Senate and by computer time from the UCLA Office of the Academic Computing. I thank Merritt Ruhlen, Joseph B. Slowinski, and Thomas D. Wickens for helpful information and suggestions.  相似文献   

7.
In this paper, dissimilarity relations are defined on triples rather than on dyads. We give a definition of a three-way distance analogous to that of the ordinary two-way distance. It is shown, as a straightforward generalization, that it is possible to define three-way ultrametric, three-way star, and three-way Euclidean distances. Special attention is paid to a model called the semi-perimeter model. We construct new methods analogous to the existing ones for ordinary distances, for example: principal coordinates analysis, the generalized Prim (1957) algorithm, hierarchical cluster analysis.  相似文献   

8.
A trend in educational testing is to go beyond unidimensional scoring and provide a more complete profile of skills that have been mastered and those that have not. To achieve this, cognitive diagnosis models have been developed that can be viewed as restricted latent class models. Diagnosis of class membership is the statistical objective of these models. As an alternative to latent class modeling, a nonparametric procedure is introduced that only requires specification of an item-by-attribute association matrix, and classifies according to minimizing a distance measure between observed responses, and the ideal response for a given attribute profile that would be implied by the item-by-attribute association matrix. This procedure requires no statistical parameter estimation, and can be used on a sample size as small as 1. Heuristic arguments are given for why the nonparametric procedure should be effective under various possible cognitive diagnosis models for data generation. Simulation studies compare classification rates with parametric models, and consider a variety of distance measures, data generation models, and the effects of model misspecification. A real data example is provided with an analysis of agreement between the nonparametric method and parametric approaches.  相似文献   

9.
Invariants of phylogenies in a simple case with discrete states   总被引:1,自引:1,他引:0  
Under a simple model of transition between two states, we can work out the probabilities of different data outcomes in four species with any given phylogeny. For a given tree topology, if all characters are evolving under the same probabilistic model, there are two quadratic forms in the frequencies of outcomes that must be zero. It may be possible to test the null hypothesis that the tree is of a particular topology by testing whether these quadratic forms are zero. One of the tests is a test for independence in a simple 2×2 contingency table. If there are differences of evolutionary rate among characters, these quadratic forms will no longer necessarily be zero.  相似文献   

10.
In many application fields, multivariate approaches that simultaneously consider the correlation between responses are needed. The tree method can be extended to multivariate responses, such as repeated measure and longitudinal data, by modifying the split function so as to accommodate multiple responses. Recently, researchers have constructed some decision trees for multiple continuous longitudinal response and multiple binary responses using Mahalanobis distance and a generalized entropy index. However, these methods have limitations according to the type of response, that is, those that are only continuous or binary. In this paper, we will modify the tree for univariate response procedure and suggest a new tree-based method that can analyze any type of multiple responses by using GEE (generalized estimating equations) techniques. To compare the performance of trees, simulation studies on selection probability of true split variable will be shown. Finally, applications using epileptic seizure data and WWW data are introduced.  相似文献   

11.
Models for the representation of proximity data (similarities/dissimilarities) can be categorized into one of three groups of models: continuous spatial models, discrete nonspatial models, and hybrid models (which combine aspects of both spatial and discrete models). Multidimensional scaling models and associated methods, used for thespatial representation of such proximity data, have been devised to accommodate two, three, and higher-way arrays. At least one model/method for overlapping (but generally non-hierarchical) clustering called INDCLUS (Carroll and Arabie 1983) has been devised for the case of three-way arrays of proximity data. Tree-fitting methods, used for thediscrete network representation of such proximity data, have only thus far been devised to handle two-way arrays. This paper develops a new methodology called INDTREES (for INdividual Differences in TREE Structures) for fitting various(discrete) tree structures to three-way proximity data. This individual differences generalization is one in which different individuals, for example, are assumed to base their judgments on the same family of trees, but are allowed to have different node heights and/or branch lengths.We initially present an introductory overview focussing on existing two-way models. The INDTREES model and algorithm are then described in detail. Monte Carlo results for the INDTREES fitting of four different three-way data sets are presented. In the application, a single ultrametric tree is fitted to three-way proximity data derived from intention-to-buy-data for various brands of over-the-counter pain relievers for relieving three common types of maladies. Finally, we briefly describe how the INDTREES procedure can be extended to accommodate hybrid modelling, as well as to handle other types of applications.  相似文献   

12.
In this study, we consider the type of interval data summarizing the original samples (individuals) with classical point data. This type of interval data are termed interval symbolic data in a new research domain called, symbolic data analysis. Most of the existing research, such as the (centre, radius) and [lower boundary, upper boundary] representations, represent an interval using only the boundaries of the interval. However, these representations hold true only under the assumption that the individuals contained in the interval follow a uniform distribution. In practice, such representations may result in not only inconsistency with the facts, since the individuals are usually not uniformly distributed in many application aspects, but also information loss for not considering the point data within the intervals during the calculation. In this study, we propose a new representation of the interval symbolic data considering the point data contained in the intervals. Then we apply the city-block distance metric to the new representation and propose a dynamic clustering approach for interval symbolic data. A simulation experiment is conducted to evaluate the performance of our method. The results show that, when the individuals contained in the interval do not follow a uniform distribution, the proposed method significantly outperforms the Hausdorff and city-block distance based on traditional representation in the context of dynamic clustering. Finally, we give an application example on the automobile data set.  相似文献   

13.
p similarity function, the L p -transform and the Minkowski-p distance. For triadic distance models defined by the L p -transform we will prove that they do not model three-way association. Moreover, triadic distance models defined by the L p -transform are restricted multiple dyadic distances, where each dyadic distance is defined for a two-way margin of the three-way table. Distance models for three-way two-mode data, called three-way distance models, do succeed in modeling three-way association.  相似文献   

14.
A mathematical programming approach to fitting general graphs   总被引:1,自引:1,他引:0  
We present an algorithm for fitting general graphs to proximity data. The algorithm utilizes a mathematical programming procedure based on a penalty function approach to impose additivity constraints upon parameters. For a user-specified number of links, the algorithm seeks to provide the connected network that gives the least-squares approximation to the proximity data with the specified number of links, allowing for linear transformations of the data. The network distance is the minimum-path-length metric for connected graphs. As a limiting case, the algorithm provides a tree where each node corresponds to an object, if the number of links is set equal to the number of objects minus one. A Monte Carlo investigation indicates that the resulting networks tend to fall within one percentage point of the least-squares solution in terms of the variance accounted for, but do not always attain this global optimum. The network model is discussed in relation to ordinal network representations (Klauer 1989) and NETSCAL (Hutchinson 1989), and applied to several well-known data sets.  相似文献   

15.
Single linkage clusters on a set of points are the maximal connected sets in a graph constructed by connecting all points closer than a given threshold distance. The complete set of single linkage clusters is obtained from all the graphs constructed using different threshold distances. The set of clusters forms a hierarchical tree, in which each non-singleton cluster divides into two or more subclusters; the runt size for each single linkage cluster is the number of points in its smallest subcluster. The maximum runt size over all single linkage clusters is our proposed test statistic for assessing multimodality. We give significance levels of the test for two null hypotheses, and consider its power against some bimodal alternatives. Research partially supported by NSF Grant No. DMS-8617919.  相似文献   

16.
In this paper we show how biplot methodology can be combined with various forms of discriminant analyses leading to highly informative visual displays of the respective class separations. It is demonstrated that the concept of distance as applied to discriminant analysis provides a unified approach to a wide variety of discriminant analysis procedures that can be accommodated by just changing to an appropriate distance metric. These changes in the distance metric are crucial for the construction of appropriate biplots. Several new types of biplots viz. quadratic discriminant analysis biplots for use with heteroscedastic stratified data, discriminant subspace biplots and flexible discriminant analysis biplots are derived and their use illustrated. Advantages of the proposed procedures are pointed out. Although biplot methodology is in particular well suited for complementing J > 2 classes discrimination problems its use in 2-class problems is also illustrated.  相似文献   

17.
We consider two fundamental properties in the analysis of two-way tables of positive data: the principle of distributional equivalence, one of the cornerstones of correspondence analysis of contingency tables, and the principle of subcompositional coherence, which forms the basis of compositional data analysis. For an analysis to be subcompositionally coherent, it suffices to analyze the ratios of the data values. A common approach to dimension reduction in compositional data analysis is to perform principal component analysis on the logarithms of ratios, but this method does not obey the principle of distributional equivalence. We show that by introducing weights for the rows and columns, the method achieves this desirable property and can be applied to a wider class of methods. This weighted log-ratio analysis is theoretically equivalent to “spectral mapping”, a multivariate method developed almost 30 years ago for displaying ratio-scale data from biological activity spectra. The close relationship between spectral mapping and correspondence analysis is also explained, as well as their connection with association modeling. The weighted log-ratio methodology is used here to visualize frequency data in linguistics and chemical compositional data in archeology. The first author acknowledges research support from the Fundación BBVA in Madrid as well as partial support by the Spanish Ministry of Education and Science, grant MEC-SEJ2006-14098. The constructive comments of the referees, who also brought additional relevant literature to our attention, significantly improved our article.  相似文献   

18.
We propose a development stemming from Roux (1988). The principle is progressively to modify the dissimilarities so that every quadruple satisfies not only the additive inequality, as in Roux's method, but also all triangle inequalities. Our method thus ensures that the results are tree distances even when the observed dissimilarities are nonmetric. The method relies on the analytic solution of the least-squares projection onto a tree distance of the dissimilarities attached to a single quadruple. This goal is achieved by using geometric reasoning which also enables an easy proof of algorithm's convergence. This proof is simpler and more complete than that of Roux (1988) and applies to other similar reduction methods based on local least-squares projection. The method is illustrated using Case's (1978) data. Finally, we provide a comparative study with simulated data and show that our method compares favorably with that of Studier and Keppler (1988) which follows in the ADDTREE tradition (Sattath and Tversky 1977). Moreover, this study seems to indicate that our method's results are generally close to the global optimum according to variance accounted for.We offer sincere thanks to Gilles Caraux, Bernard Fichet, Alain Guénoche, and Maurice Roux for helpful discussions, advice, and for reading the preliminary versions of this paper. We are grateful to three anonymous referees and to the editor for many insightful comments. This research was supported in part by the GREG and the IA2 network.  相似文献   

19.
Given a set of pairwise distances on a set of n points, constructing an edgeweighted tree whose leaves are these n points such that the tree distances would mimic the original distances under some criteria is a fundamental problem. One such criterion is to preserve the ordinal relation between the pairwise distances. The ordinal relation can be of the form of total order on the distances or it can be some partial order specified on the pairwise distances. We show that the problem of finding a weighted tree, if it exists, which would preserve the total order on pairwise distances is NP-hard. We also show the NP-hardness of the problem of finding a weighted tree which would preserve a particular kind of partial order called a triangle order, one of the most fundamental partial orders considered in computational biology.  相似文献   

20.
A mathematical programming algorithm is developed for fitting ultrametric or additive trees to proximity data where external constraints are imposed on the topology of the tree. The two procedures minimize a least squares loss function. The method is illustrated on both synthetic and real data. A constrained ultrametric tree analysis was performed on similarities between 32 subjects based on preferences for ten odors, while a constrained additive tree analysis was carried out on some proximity data between kinship terms. Finally, some extensions of the methodology to other tree fitting procedures are mentioned.The first author is supported as Bevoegdverklaard Navorser of the Belgian Nationaal Fonds voor Wetenschappelijk Onderzoek.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号