首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
In many application fields, multivariate approaches that simultaneously consider the correlation between responses are needed. The tree method can be extended to multivariate responses, such as repeated measure and longitudinal data, by modifying the split function so as to accommodate multiple responses. Recently, researchers have constructed some decision trees for multiple continuous longitudinal response and multiple binary responses using Mahalanobis distance and a generalized entropy index. However, these methods have limitations according to the type of response, that is, those that are only continuous or binary. In this paper, we will modify the tree for univariate response procedure and suggest a new tree-based method that can analyze any type of multiple responses by using GEE (generalized estimating equations) techniques. To compare the performance of trees, simulation studies on selection probability of true split variable will be shown. Finally, applications using epileptic seizure data and WWW data are introduced.  相似文献   

2.
K -means partitioning. We also describe some new features and improvements to the algorithm proposed by De Soete. Monte Carlo simulations have been conducted using different error conditions. In all cases (i.e., ultrametric or additive trees, or K-means partitioning), the simulation results indicate that the optimal weighting procedure should be used for analyzing data containing noisy variables that do not contribute relevant information to the classification structure. However, if the data involve error-perturbed variables that are relevant to the classification or outliers, it seems better to cluster or partition the entities by using variables with equal weights. A new computer program, OVW, which is available to researchers as freeware, implements improved algorithms for optimal variable weighting for ultrametric and additive tree clustering, and includes a new algorithm for optimal variable weighting for K-means partitioning.  相似文献   

3.
Trees, and particularly binary trees, appear frequently in the classification literature. When studying the properties of the procedures that fit trees to sets of data, direct analysis can be too difficult, and Monte Carlo simulations may be necessary, requiring the implementation of algorithms for the generation of certain families of trees at random. In the present paper we use the properties of Prufer's enumeration of the set of completely labeled trees to obtain algorithms for the generation of completely labeled, as well as terminally labeled t-ary (and in particular binary) trees at random, i.e., with uniform distribution. Actually, these algorithms are general in that they can be used to generate random trees from any family that can be characterized in terms of the node degrees. The algorithms presented here are as fast as (in the case of terminally labeled trees) or faster than (in the case of completely labeled trees) any other existing procedure, and the memory requirements are minimal. Another advantage over existing algorithms is that there is no need to store pre-calculated tables.  相似文献   

4.
Two algorithms for pyramidal classification — a generalization of hierarchical classification — are presented that can work with incomplete dissimilarity data. These approaches — a modification of the pyramidal ascending classification algorithm and a least squares based penalty method — are described and compared using two different types of complete dissimilarity data in which randomly chosen dissimilarities are assumed missing and the non-missing ones are subjected to random error. We also consider relationships between hierarchical classification and pyramidal classification solutions when both are based on incomplete dissimilarity data.  相似文献   

5.
We describe a novel extension to the Class-Cover-Catch-Digraph (CCCD) classifier, specifically tuned to detection problems. These are two-class classification problems where the natural priors on the classes are skewed by several orders of magnitude. The emphasis of the proposed techniques is in computationally efficient classification for real-time applications. Our principal contribution consists of two boosted classi- fiers built upon the CCCD structure, one in the form of a sequential decision process and the other in the form of a tree. Both of these classifiers achieve performances comparable to that of the original CCCD classifiers, but at drastically reduced computational expense. An analysis of classification performance and computational cost is performed using data from a face detection application. Comparisons are provided with Support Vector Machines (SVM) and reduced SVMs. These comparisons show that while some SVMs may achieve higher classification performance, their computational burden can be so high as to make them unusable in real-time applications. On the other hand, the proposed classifiers combine high detection performance with extremely fast classification.  相似文献   

6.
Many methods and algorithms to generate random trees of many kinds have been proposed in the literature. No procedure exists however for the generation of dendrograms with randomized fusion levels. Randomized dendrograms can be obtained by randomizing the associated cophenetic matrix. Two algorithms are described. The first one generates completely random dendrograms, i.e., trees with a random topology, random fusion level values, and random assignment of the labels. The second algorithm uses a double-permutation procedure to randomize a given dendrogram; it proceeds by randomization of the fixed fusion levels, instead of using random fusion level values. A proof is presented that the double-permutation procedure is a Uniform Random Generation Algorithmsensu Furnas (1984), and a complete example is given. This work was supported by NSERC Grant No. A7738 to P. Legendre and by a NSERC scholarship to F.-J. Lapointe.  相似文献   

7.
The Neighbor-Joining (NJ) method of Saitou and Nei is the most widely used distance based method in phylogenetic analysis. Central to the method is the selection criterion, the formula used to choose which pair of objects to amalgamate next. Here we analyze the NJ selection criterion using an axiomatic approach. We show that any selection criterion that is linear, permutation equivariant, statistically consistent and based solely on distance data will give the same trees as those created by NJ.  相似文献   

8.
This paper introduces a novel mixture model-based approach to the simultaneous clustering and optimal segmentation of functional data, which are curves presenting regime changes. The proposed model consists of a finite mixture of piecewise polynomial regression models. Each piecewise polynomial regression model is associated with a cluster, and within each cluster, each piecewise polynomial component is associated with a regime (i.e., a segment). We derive two approaches to learning the model parameters: the first is an estimation approach which maximizes the observed-data likelihood via a dedicated expectation-maximization (EM) algorithm, then yielding a fuzzy partition of the curves into K clusters obtained at convergence by maximizing the posterior cluster probabilities. The second is a classification approach and optimizes a specific classification likelihood criterion through a dedicated classification expectation-maximization (CEM) algorithm. The optimal curve segmentation is performed by using dynamic programming. In the classification approach, both the curve clustering and the optimal segmentation are performed simultaneously as the CEM learning proceeds. We show that the classification approach is a probabilistic version generalizing the deterministic K-means-like algorithm proposed in Hébrail, Hugueney, Lechevallier, and Rossi (2010). The proposed approach is evaluated using simulated curves and real-world curves. Comparisons with alternatives including regression mixture models and the K-means-like algorithm for piecewise regression demonstrate the effectiveness of the proposed approach.  相似文献   

9.
A comparison between two distance-based discriminant principles   总被引:1,自引:1,他引:0  
A distance-based classification procedure suggested by Matusita (1956) has long been available as an alternative to the usual Bayes decision rule. Unsatisfactory features of both approaches when applied to multinomial data led Goldstein and Dillon (1978) to propose a new distance-based principle for classification. We subject the Goldstein/Dillon principle to some theoretical scrutiny by deriving the population classification rules appropriate not only to multinomial data but also to multivariate normal and mixed multinomial/multinormal data. These rules demonstrate equivalence of the Goldstein/Dillon and Matusita approaches for the first two data types, and similar equivalence is conjectured (but not explicitly obtained) for the mixed data case. Implications for sample-based rules are noted.  相似文献   

10.
We describe a simple time series transformation to detect differences in series that can be accurately modelled as stationary autoregressive (AR) processes. The transformation involves forming the histogram of above and below the mean run lengths. The run length (RL) transformation has the benefits of being very fast, compact and updatable for new data in constant time. Furthermore, it can be generated directly from data that has already been highly compressed. We first establish the theoretical asymptotic relationship between run length distributions and AR models through consideration of the zero crossing probability and the distribution of runs. We benchmark our transformation against two alternatives: the truncated Autocorrelation function (ACF) transform and the AR transformation, which involves the standard method of fitting the partial autocorrelation coefficients with the Durbin-Levinson recursions and using the Akaike Information Criterion stopping procedure. Whilst optimal in the idealized scenario, representing the data in these ways is time consuming and the representation cannot be updated online for new data. We show that for classification problems the accuracy obtained through using the run length distribution tends towards that obtained from using the full fitted models. We then propose three alternative distance measures for run length distributions based on Gower’s general similarity coefficient, the likelihood ratio and dynamic time warping (DTW). Through simulated classification experiments we show that a nearest neighbour distance based on DTW converges to the optimal faster than classifiers based on Euclidean distance, Gower’s coefficient and the likelihood ratio. We experiment with a variety of classifiers and demonstrate that although the RL transform requires more data than the best performing classifier to achieve the same accuracy as AR or ACF, this factor is at worst non-increasing with the series length, m, whereas the relative time taken to fit AR and ACF increases with m. We conclude that if the data is stationary and can be suitably modelled by an AR series, and if time is an important factor in reaching a discriminatory decision, then the run length distribution transform is a simple and effective transformation to use.  相似文献   

11.
This paper presents the development of a new methodology which simultaneously estimates in a least-squares fashion both an ultrametric tree and respective variable weightings for profile data that have been converted into (weighted) Euclidean distances. We first review the relevant classification literature on this topic. The new methodology is presented including the alternating least-squares algorithm used to estimate the parameters. The method is applied to a synthetic data set with known structure as a test of its operation. An application of this new methodology to ethnic group rating data is also discussed. Finally, extensions of the procedure to model additive, multiple, and three-way trees are mentioned.The first author is supported as Bevoegdverklaard Navorser of the Belgian Nationaal Fonds voor Wetenschappelijk Onderzoek.  相似文献   

12.
There have been many comparative studies of classification methods in which real datasets are used as a gauge to assess the relative performance of the methods. Since these comparisons often yield inconclusive or limited results on how methods perform, it is often believed that a broader approach combining these studies would shed some light on this difficult question. This paper describes such an attempt: we have sampled the available literature and created a dataset of 5807 classification results. We show that one of the possible ways to analyze the resulting data is an overall assessment of the classification methods, and we present methods for that particular aim. The merits and demerits of such an approach are discussed, and conclusions are drawn which may assist future research: we argue that the current state of the literature hardly allows large-scale investigations. This work was sponsored by the MOD Corporate Research Programme, CISP, as part of a larger project on technology assessment. We would like to express our appreciation to Andrew Webb for his support throughout the entire project, and to Wojtek Krzanowski for valuable comments on a draft of this paper. We would also like to thank the anonymous referees for some very interesting comments, some of which we hope to pursue in future work.  相似文献   

13.
Tree enumeration modulo a consensus   总被引:1,自引:1,他引:0  
The number of trees withn labeled terminal vertices grows too rapidly withn to permit exhaustive searches for Steiner trees or other kinds of optima in cladistics and related areas Often, however, structured constraints are known and may be imposed on the set of trees to be scanned These constraints may be formulated in terms of a consensus among the trees to be searched We calculate the reduction in the number of trees to be enumerated as a function of properties of the imposed consensusThis work was supported in part by the Natural Sciences and Engineering Research Council of Canada through operating grant A8867 to D Sankoff and infrastructure grant A3092 to D Sankoff, R J Cedergren and G Lapalme We are grateful to William H E Day for much encouragement and many helpful suggestions  相似文献   

14.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.  相似文献   

15.
Ordered set theory provides efficient tools for the problems of comparison and consensus of classifications Here, an overview of results obtained by the ordinal approach is presented Latticial or semilatticial structures of the main sets of classification models are described Many results on partitions are adaptable to dendrograms; many results on n-trees hold in any median semilattice and thus have counterparts on ordered trees and Buneman (phylogenetic) trees For the comparison of classifications, the semimodularity of the ordinal structures involved yields computable least-move metrics based on weighted or unweighted elementary transformations In the unweighted case, these metrics have simple characteristic properties For the consensus of classifications, the constructive, axiomatic, and optimization approaches are considered Natural consensus rules (majoritary, oligarchic, ) have adequate ordinal formalizations A unified presentation of Arrow-like characterization results is given In the cases of n-trees, ordered trees and Buneman trees, the majority rule is a significant example where the three approaches convergeThe authors would like to thank the anonymous referees for helpful suggestions on the first draft of this paper, and W H E Day for his comments and his significant improvements of style  相似文献   

16.
We propose a new nonparametric family of oscillation heuristics for improving linear classifiers in the two-group discriminant problem. The heuristics are motivated by the intuition that the classification accuracy of a separating hyperplane can be improved through small perturbations to its slope and position, accomplished by substituting training observations near the hyperplane for those used to generate it. In an extensive simulation study, using data generated from multivariate normal distributions under a variety of conditions, the oscillation heuristics consistently improve upon the classical linear and logistic discriminant functions, as well as two published linear programming-based heuristics and a linear Support Vector Machine. Added to any of the methods above, they approach, and frequently attain, the best possible accuracy on the training samples, as determined by a mixed-integer programming (MIP) model, at a much smaller computational cost. They also improve expected accuracy on the overall populations when the populations overlap significantly and the heuristics are trained with large samples, at least in situations where the data conditions do not explicitly favor a particular classifier.  相似文献   

17.
Recognizing the successes of treed Gaussian process (TGP) models as an interpretable and thrifty model for nonparametric regression, we seek to extend the model to classification. Both treed models and Gaussian processes (GPs) have, separately, enjoyed great success in application to classification problems. An example of the former is Bayesian CART. In the latter, real-valued GP output may be utilized for classification via latent variables, which provide classification rules by means of a softmax function. We formulate a Bayesian model averaging scheme to combine these two models and describe a Monte Carlo method for sampling from the full posterior distribution with joint proposals for the tree topology and the GP parameters corresponding to latent variables at the leaves. We concentrate on efficient sampling of the latent variables, which is important to obtain good mixing in the expanded parameter space. The tree structure is particularly helpful for this task and also for developing an efficient scheme for handling categorical predictors, which commonly arise in classification problems. Our proposed classification TGP (CTGP) methodology is illustrated on a collection of synthetic and real data sets. We assess performance relative to existing methods and thereby show how CTGP is highly flexible, offers tractable inference, produces rules that are easy to interpret, and performs well out of sample.  相似文献   

18.
We describe a new wavelet transform, for use on hierarchies or binary rooted trees. The theoretical framework of this approach to data analysis is described. Case studies are used to further exemplify this approach. A first set of application studies deals with data array smoothing, or filtering. A second set of application studies relates to hierarchical tree condensation. Finally, a third study explores the wavelet decomposition, and the reproducibility of data sets such as text, including a new perspective on the generation or computability of such data objects.  相似文献   

19.
Variable Selection for Clustering and Classification   总被引:2,自引:2,他引:0  
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.  相似文献   

20.
One of the most important problems in classification is that of quantitative comparison of hierarchical trees. In this note we answer an open problem of Culík and Wood (1982) concerning the nearest neighbor interchange metric by proving that its underlying decision problem is NP-complete.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号