共查询到20条相似文献,搜索用时 141 毫秒
1.
Pruning a decision tree is considered by some researchers to be the most important part of tree building in noisy domains. While there are many approaches to pruning, the alternative of averaging over decision trees has not received as much attention. The basic idea of tree averaging is to produce a weighted sum of decisions. We consider the set of trees used for the averaging process, and how weights should be assigned to each tree in this set. We define the concept of afanned set for a tree, and examine how the Minimum Message Length paradigm of learning may be used to average over decision trees. We perform an empirical evaluation of two averaging approaches, and a Minimum Message Length approach.This work has been carried out with the support of the Defence Research Agency, Malvern. 相似文献
2.
Consensus supertrees: The synthesis of rooted trees containing overlapping sets of labeled leaves 总被引:2,自引:2,他引:0
AD Gordon 《Journal of Classification》1986,3(2):335-348
Given two dendrograms (rooted tree diagrams) which have some but not all of their base points in common, a supertree is a dendrogram from which each of the original trees can be regarded as samples The distinction is made between inconsistent and consistent sample trees, defined by whether or not the samples provide contradictory information about the supertree An algorithm for obtaining the strict consensus supertree of two consistent sample trees is presented, as are procedures for merging two inconsistent sample trees Some suggestions for future work are made 相似文献
3.
Edward N. Adams III 《Journal of Classification》1986,3(2):299-317
Interpreting a taxonomic tree as a set of objects leads to natural measures of complexity and similarity, and sets natural lower bounds on a consensus tree Interpretations differing as to the kind of objects constituting a tree lead to different measures and consensus Subset nesting is preferred over the clusters (strict consensus) and even the triads interpretations because of its superior expression of shared structure Algorithms for computing the complexity and similarity of trees, as well as a consensus index onto [0,1], are presented for this interpretation The full consensus is defined as the only tree which includes all the nestings shared in a profile of rival trees and whose clusters reflect only nestings shared in the profile The full consensus is proved to exist uniquely for each profile, and to equal the Adams consensusThe author is grateful for the many helpful comments on presentation from Frances McA Adams, William H E Day, and Christopher A Meacham 相似文献
4.
Michael Steel 《Journal of Classification》1992,9(1):91-116
In taxonomy and other branches of classification it is useful to know when tree-like classifications on overlapping sets of
labels can be consistently combined into a parent tree. This paper considers the computation complexity of this problem. Recognizing
when a consistent parent tree exists is shown to be intractable (NP-complete) for sets of unrooted trees, even when each tree
in the set classifies just four labels. Consequently determining the compatibility of qualitative characters and partial binary
characters is, in general, also NP-complete. However for sets of rooted trees an algorithm is described which constructs the
“strict consensus tree” of all consistent parent trees (when they exist) in polynomial time. The related question of recognizing
when a set of subtrees uniquely defines a parent tree is also considered, and a simple necessary and sufficient condition
is described for rooted trees.
This work was supproted by the Alexander von Humoldt-Stiftung. I wish to thank Andreas Dress, Hans-Jürgen Bandelt and the
referees for their helpful comments. 相似文献
5.
George W. Furnas 《Journal of Classification》1984,1(1):187-233
Several techniques are given for the uniform generation of trees for use in Monte Carlo studies of clustering and tree representations. First, general strategies are reviewed for random selection from a set of combinatorial objects with special emphasis on two that use random mapping operations. Theorems are given on how the number of such objects in the set (e.g., whether the number is prime) affects which strategies can be used. Based on these results, methods are presented for the random generation of six types of binary unordered trees. Three types of labeling and both rooted and unrooted forms are considered. Presentation of each method includes the theory of the method, the generation algorithm, an analysis of its computational complexity and comments on the distribution of trees over which it samples. Formal proofs and detailed algorithms are in appendices. 相似文献
6.
A clustering that consists of a nested set of clusters may be represented graphically by a tree. In contrast, a clustering
that includes non-nested overlapping clusters (sometimes termed a “nonhierarchical” clustering) cannot be represented by a
tree. Graphical representations of such non-nested overlapping clusterings are usually complex and difficult to interpret.
Carroll and Pruzansky (1975, 1980) suggested representing non-nested clusterings with multiple ultrametric or additive trees.
Corter and Tversky (1986) introduced the extended tree (EXTREE) model, which represents a non-nested structure as a tree plus
overlapping clusters that are represented by marked segments in the tree. We show here that the problem of finding a nested
(i.e., tree-structured) set of clusters in an overlapping clustering can be reformulated as the problem of finding a clique
in a graph. Thus, clique-finding algorithms can be used to identify sets of clusters in the solution that can be represented
by trees. This formulation provides a means of automatically constructing a multiple tree or extended tree representation
of any non-nested clustering. The method, called “clustrees”, is applied to several non-nested overlapping clusterings derived
using the MAPCLUS program (Arabie and Carroll 1980). 相似文献
7.
Spectral analysis of phylogenetic data 总被引:12,自引:0,他引:12
The spectral analysis of sequence and distance data is a new approach to phylogenetic analysis. For two-state character sequences,
the character values at a given site split the set of taxa into two subsets, a bipartition of the taxa set. The vector which
counts the relative numbers of each of these bipartitions over all sites is called a sequence spectrum. Applying a transformation
called a Hadamard conjugation, the sequence spectrum is transformed to the conjugate spectrum. This conjugation corrects for
unobserved changes in the data, independently from the choice of phylogenetic tree. For any given phylogenetic tree with edge
weights (probabilities of state change), we define a corresponding tree spectrum. The selection of a weighted phylogenetic
tree from the given sequence data is made by matching the conjugate spectrum with a tree spectrum. We develop an optimality
selection procedure using a least squares best fit, to find the phylogenetic tree whose tree spectrum most closely matches
the conjugate spectrum. An inferred sequence spectrum can be derived from the selected tree spectrum using the inverse Hadamard
conjugation to allow a comparison with the original sequence spectrum.
A possible adaptation for the analysis of four-state character sequences with unequal frequencies is considered. A corresponding
spectral analysis for distance data is also introduced. These analyses are illustrated with biological examples for both distance
and sequence data. Spectral analysis using the Fast Hadamard transform allows optimal trees to be found for at least 20 taxa
and perhaps for up to 30 taxa.
The development presented here is self contained, although some mathematical proofs available elsewhere have been omitted.
The analysis of sequence data is based on methods reported earlier, but the terminology and the application to distance data
are new. 相似文献
8.
Fionn Murtagh 《Journal of Classification》2007,24(1):3-32
We describe a new wavelet transform, for use on hierarchies or binary rooted trees. The theoretical framework of this approach
to data analysis is described. Case studies are used to further exemplify this approach. A first set of application studies
deals with data array smoothing, or filtering. A second set of application studies relates to hierarchical tree condensation.
Finally, a third study explores the wavelet decomposition, and the reproducibility of data sets such as text, including a
new perspective on the generation or computability of such data objects. 相似文献
9.
We introduce a test for detecting multimodality in distributions based on minimal constrained spanning trees. We define a Minimal Ascending Path Spanning Tree (MAPST) on a set of points as a spanning tree that has the minimal possible sum of lengths of links with the constraint that starting from any link, the lengths of the links are non-increasing towards a root node. We define similarly MAPSTs with more than one root. We present some algorithms for finding such trees. Based on these trees, we devise a test for multimodality, called the MAP Test (for Minimal Ascending Path). Using simulations, we estimate percentage points of the MAP statistic and assess the power of the test. Finally, we illustrate the use of MAPSTs for determining the number of modes in a distribution of positions of galaxies on photographic plates from a rich galaxy cluster. 相似文献
10.
Least squares algorithms for constructing constrained ultrametric and additive tree representations of symmetric proximity data 总被引:1,自引:1,他引:0
A mathematical programming algorithm is developed for fitting ultrametric or additive trees to proximity data where external constraints are imposed on the topology of the tree. The two procedures minimize a least squares loss function. The method is illustrated on both synthetic and real data. A constrained ultrametric tree analysis was performed on similarities between 32 subjects based on preferences for ten odors, while a constrained additive tree analysis was carried out on some proximity data between kinship terms. Finally, some extensions of the methodology to other tree fitting procedures are mentioned.The first author is supported as Bevoegdverklaard Navorser of the Belgian Nationaal Fonds voor Wetenschappelijk Onderzoek. 相似文献
11.
Optimal variable weighting for hierarchical clustering: An alternating least-squares algorithm 总被引:4,自引:4,他引:0
This paper presents the development of a new methodology which simultaneously estimates in a least-squares fashion both an ultrametric tree and respective variable weightings for profile data that have been converted into (weighted) Euclidean distances. We first review the relevant classification literature on this topic. The new methodology is presented including the alternating least-squares algorithm used to estimate the parameters. The method is applied to a synthetic data set with known structure as a test of its operation. An application of this new methodology to ethnic group rating data is also discussed. Finally, extensions of the procedure to model additive, multiple, and three-way trees are mentioned.The first author is supported as Bevoegdverklaard Navorser of the Belgian Nationaal Fonds voor Wetenschappelijk Onderzoek. 相似文献
12.
Single linkage clusters on a set of points are the maximal connected sets in a graph constructed by connecting all points
closer than a given threshold distance. The complete set of single linkage clusters is obtained from all the graphs constructed
using different threshold distances. The set of clusters forms a hierarchical tree, in which each non-singleton cluster divides
into two or more subclusters; the runt size for each single linkage cluster is the number of points in its smallest subcluster.
The maximum runt size over all single linkage clusters is our proposed test statistic for assessing multimodality. We give
significance levels of the test for two null hypotheses, and consider its power against some bimodal alternatives.
Research partially supported by NSF Grant No. DMS-8617919. 相似文献
13.
Optimal algorithms for comparing trees with labeled leaves 总被引:2,自引:1,他引:1
William H. E. Day 《Journal of Classification》1985,2(1):7-28
LetR
n
denote the set of rooted trees withn leaves in which: the leaves are labeled by the integers in {1, ...,n}; and among interior vertices only the root may have degree two. Associated with each interior vertexv in such a tree is the subset, orcluster, of leaf labels in the subtree rooted atv. Cluster {1, ...,n} is calledtrivial. Clusters are used in quantitative measures of similarity, dissimilarity and consensus among trees. For anyk trees inR
n
, thestrict consensus tree C(T
1, ...,T
k
) is that tree inR
n
containing exactly those clusters common to every one of thek trees. Similarity between treesT
1 andT
2 inR
n
is measured by the numberS(T
1,T
2) of nontrivial clusters in bothT
1 andT
2; dissimilarity, by the numberD(T
1,T
2) of clusters inT
1 orT
2 but not in both. Algorithms are known to computeC(T
1, ...,T
k
) inO(kn
2) time, andS(T
1,T
2) andD(T
1,T
2) inO(n
2) time. I propose a special representation of the clusters of any treeT R
n
, one that permits testing in constant time whether a given cluster exists inT. I describe algorithms that exploit this representation to computeC(T
1, ...,T
k
) inO(kn) time, andS(T
1,T
2) andD(T
1,T
2) inO(n) time. These algorithms are optimal in a technical sense. They enable well-known indices of consensus between two trees to be computed inO(n) time. All these results apply as well to comparable problems involving unrooted trees with labeled leaves.The Natural Sciences and Engineering Research Council of Canada partially supported this work with grant A-4142. 相似文献
14.
A new method, TreeOfTrees, is proposed to compare X-tree structures obtained from several sets of aligned gene sequences of
the same taxa. Its aim is to detect genes or sets of genes having different evolutionary histories. The comparison between
sets of trees is based on several tree metrics, leading to a unique tree labelled by the gene trees. The robustness values
of its edges are estimated by bootstrapping and consensus procedures that allow detecting subsets of genes having differently
evolved. Simulations are performed under various evolutionary conditions to test the efficiency of the method and an application
on real data is described. Tests of arboricity and various consensus algorithms are also discussed. A corresponding software
package is available. 相似文献
15.
The character and OTU stability of classifications based on UPGMA clustering and maximum parsimony (MP) trees were compared
for 5 datasets (families of angiosperms, families of orthopteroid insects, species of the fish genusIctalurus, genera of the salamander family Salamandridae, and genera of the frog family Myobatrachidae). Stability was investigated
by taking different sized random subsamples of OTUs or characters, computing UPGMA clusters and an MP tree, and then comparing
the resulting trees with those based on the entire dataset. Agreement was measured by two consensus indices, that of Colless,
computed from strict consensus trees, and Stinebrickner's 0.5-consensus index.
Tests of character stability generally showed a monotone decrease in agreement with the standard as smaller sets of characters
are considered. The relative success of the two methods depended upon the dataset. Tests of OTU stability showed a monotone
decrease in agreement for UPGMA as smaller sets of OTUs are considered. But for MP, agreement decreased and then increased
again on the same scale. The apparent superiority of UPGMA relative to MP with respect to OTU stability depended upon the
dataset. Considerations other than stability, such as computer efficiency or accuracy, will also determine the method of choice
for classifications. 相似文献
16.
Seong Keon Lee 《Journal of Classification》2006,23(1):123-141
In many application fields, multivariate approaches that simultaneously consider the correlation between responses are needed.
The tree method can be extended to multivariate responses, such as repeated measure and longitudinal data, by modifying the
split function so as to accommodate multiple responses. Recently, researchers have constructed some decision trees for multiple
continuous longitudinal response and multiple binary responses using Mahalanobis distance and a generalized entropy index.
However, these methods have limitations according to the type of response, that is, those that are only continuous or binary.
In this paper, we will modify the tree for univariate response procedure and suggest a new tree-based method that can analyze
any type of multiple responses by using GEE (generalized estimating equations) techniques. To compare the performance of trees,
simulation studies on selection probability of true split variable will be shown. Finally, applications using epileptic seizure
data and WWW data are introduced. 相似文献
17.
An asymmetric multidimensional scaling model and an associated nonmetric algorithm to analyze two-mode three-way proximities (object × object × source) are introduced. The model consists of a common object configuration and two kinds of weights, i.e., for both symmetry and asymmetry. In the common object configuration, each object is represented by a point and a circle (sphere, hypersphere) in a Euclidean space. The common object configuration represents pairwise proximity relationships between pairs of objects for the ‘group’ of all sources. Each source has its own symmetry weight and a set of asymmetry weights. Symmetry weights represent individual differences among sources of data in symmetric proximity relationships, and asymmetry weights represent individual differences among sources in asymmetric proximity relationships. The associated nonmetric algorithm, based on Kruskal’s (1964b) nonmetric multidimensional scaling algorithm, is an extension of the algorithm for the asymmetric multidimensional scaling of one mode two-way proximities developed earlier (Okada and Imaizumi 1987). As an illustrative example, we analyze intergenerational occupational mobility from 1955 to 1985 in Japan among eight occupational categories. 相似文献
18.
Probabilistic feature models (PFMs) can be used to explain binary rater judgements about the associations between two types of elements (e.g., objects and attributes) on the basis of binary latent features. In particular, to explain observed object-attribute associations PFMs assume that respondents classify both objects and attributes with respect to a, usually small, number of binary latent features, and that the observed object-attribute association is derived as a specific mapping of these classifications. Standard PFMs assume that the object-attribute association probability is the same according to all respondents, and that all observations are statistically independent. As both assumptions may be unrealistic, a multilevel latent class extension of PFMs is proposed which allows objects and/or attribute parameters to be different across latent rater classes, and which allows to model dependencies between associations with a common object (attribute) by assuming that the link between features and objects (attributes) is fixed across judgements. Formal relationships with existing multilevel latent class models for binary three-way data are described. As an illustration, the models are used to study rater differences in product perception and to investigate individual differences in the situational determinants of anger-related behavior. 相似文献
19.
A method is presented for the graphic display of proximity matrices as a complement to the common data analysis techniques of hierarchical clustering. The procedure involves the use of computer generated shaded matrices based on unclassed choropleth mapping in conjunction with a strategy for matrix reorganization. The latter incorporates a combination of techniques for seriation and the ordering of binary trees.Partial support for this research was provided by NIJ Grant #82-IJ-CX-0019 and NSF Grant #SES82-06067. The authors wish to acknowledge the assistance of Professors L.J. Hubert, R.G. Golledge, and W.R. Tobler. 相似文献
20.
Givenk rooted binary treesA
1, A2, ..., Ak, with labeled leaves, we generateC, a unique system of lineage constraints on common ancestors. We then present an algorithm for constructing the set of rooted binary treesB, compatible with all ofA
1, A2, ..., Ak. The running time to obtain one such supertree isO(k
2 n2), wheren is the number of distinct leaves in all of the treesA
1, A2, ..., Ak. 相似文献