首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 531 毫秒
1.
The mixture method of clustering applied to three-way data   总被引:3,自引:3,他引:0  
Clustering or classifying individuals into groups such that there is relative homogeneity within the groups and heterogeneity between the groups is a problem which has been considered for many years. Most available clustering techniques are applicable only to a two-way data set, where one of the modes is to be partitioned into groups on the basis of the other mode. Suppose, however, that the data set is three-way. Then what is needed is a multivariate technique which will cluster one of the modes on the basis of both of the other modes simultaneously. It is shown that by appropriate specification of the underlying model, the mixture maximum likelihood approach to clustering can be applied in the context of a three-way table. It is illustrated using a soybean data set which consists of multiattribute measurements on a number of genotypes each grown in several environments. Although the problem is set in the framework of clustering genotypes, the technique is applicable to other types of three-way data sets.  相似文献   

2.
We describe a novel extension to the Class-Cover-Catch-Digraph (CCCD) classifier, specifically tuned to detection problems. These are two-class classification problems where the natural priors on the classes are skewed by several orders of magnitude. The emphasis of the proposed techniques is in computationally efficient classification for real-time applications. Our principal contribution consists of two boosted classi- fiers built upon the CCCD structure, one in the form of a sequential decision process and the other in the form of a tree. Both of these classifiers achieve performances comparable to that of the original CCCD classifiers, but at drastically reduced computational expense. An analysis of classification performance and computational cost is performed using data from a face detection application. Comparisons are provided with Support Vector Machines (SVM) and reduced SVMs. These comparisons show that while some SVMs may achieve higher classification performance, their computational burden can be so high as to make them unusable in real-time applications. On the other hand, the proposed classifiers combine high detection performance with extremely fast classification.  相似文献   

3.
In this paper, we present empirical and theoretical results on classification trees for randomized response data. We considered a dichotomous sensitive response variable with the true status intentionally misclassified by the respondents using rules prescribed by a randomized response method. We assumed that classification trees are grown using the Pearson chi-square test as a splitting criterion, and that the randomized response data are analyzed using classification trees as if they were not perturbed. We proved that classification trees analyzing observed randomized response data and estimated true data have a one-to-one correspondence in terms of ranking the splitting variables. This is illustrated using two real data sets.  相似文献   

4.
Functional data sets appear in many areas of science. Although each data point may be seen as a large finite-dimensional vector it is preferable to think of them as functions, and many classical multivariate techniques have been generalized for this kind of data. A widely used technique for dealing with functional data is to choose a finite-dimensional basis and find the best projection of each curve onto this basis. Therefore, given a functional basis, an approach for doing curve clustering relies on applying the k-means methodology to the fitted basis coefficients corresponding to all the curves in the data set. Unfortunately, a serious drawback follows from the lack of robustness of k-means. Trimmed k-means clustering (Cuesta-Albertos, Gordaliza, and Matran 1997) provides a robust alternative to the use of k-means and, consequently, it may be successfully used in this functional framework. The proposed approach will be exemplified by considering cubic B-splines bases, but other bases can be applied analogously depending on the application at hand.  相似文献   

5.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.  相似文献   

6.
runt pruning , a new clustering method that attempts to find modes of a density by analyzing the minimal spanning tree of a sample. The method exploits the connection between the minimal spanning tree and nearest neighbor density (e.g. normal mixture) or about the geometric shapes of the clusters, and is computationally feasible for large data sets.  相似文献   

7.
Incremental Classification with Generalized Eigenvalues   总被引:2,自引:0,他引:2  
Supervised learning techniques are widely accepted methods to analyze data for scientific and real world problems. Most of these problems require fast and continuous acquisition of data, which are to be used in training the learning system. Therefore, maintaining such systems updated may become cumbersome. Various techniques have been devised in the field of machine learning to solve this problem. In this study, we propose an algorithm to reduce the training data to a substantially small subset of the original training data to train a generalized eigenvalue classifier. The proposed method provides a constructive way to understand the influence of new training data on an existing classification function. We show through numerical experiments that this technique prevents the overfitting problem of the earlier generalized eigenvalue classifiers, while promising a comparable performance in classification with respect to the state-of-the-art classification methods.  相似文献   

8.
Recognizing the successes of treed Gaussian process (TGP) models as an interpretable and thrifty model for nonparametric regression, we seek to extend the model to classification. Both treed models and Gaussian processes (GPs) have, separately, enjoyed great success in application to classification problems. An example of the former is Bayesian CART. In the latter, real-valued GP output may be utilized for classification via latent variables, which provide classification rules by means of a softmax function. We formulate a Bayesian model averaging scheme to combine these two models and describe a Monte Carlo method for sampling from the full posterior distribution with joint proposals for the tree topology and the GP parameters corresponding to latent variables at the leaves. We concentrate on efficient sampling of the latent variables, which is important to obtain good mixing in the expanded parameter space. The tree structure is particularly helpful for this task and also for developing an efficient scheme for handling categorical predictors, which commonly arise in classification problems. Our proposed classification TGP (CTGP) methodology is illustrated on a collection of synthetic and real data sets. We assess performance relative to existing methods and thereby show how CTGP is highly flexible, offers tractable inference, produces rules that are easy to interpret, and performs well out of sample.  相似文献   

9.
We propose using the integrated periodogram to classify time series. The method assigns a new time series to the group that minimizes the distance between the series integrated periodogram and the group mean of integrated periodograms. Local computation of these periodograms allows the application of this approach to nonstationary time series. Since the integrated periodograms are curves, we apply functional data depth-based techniques to make the classification robust, which is a clear advantage over other competitive procedures. The method provides small error rates for both simulated and real data. It improves existing approaches and presents good computational behavior.  相似文献   

10.
The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical application. A simple and common approach is to first cluster a random sample of moderate size, and then use the clustering model found in this way to classify the remainder of the objects. We show that, in its simplest form, this method may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling method: several tentative models are identified from the sample instead of just one, and several EM steps are used rather than just one E step to classify the full data set. We find that there are significant gains from increasing the size of the sample up to about 2,000, but not from further increases. These conclusions are based on the application of several alternative strategies to the segmentation of three different multispectral images, and to several simulated data sets.  相似文献   

11.
A sequential fitting procedure for linear data analysis models   总被引:1,自引:1,他引:0  
A particular factor analysis model with parameter constraints is generalized to include classification problems definable within a framework of fitting linear models. The sequential fitting (SEFIT) approach of principal component analysis is extended to include several nonstandard data analysis and classification tasks. SEFIT methods attempt to explain the variability in the initial data (commonly defined by a sum of squares) through an additive decomposition attributable to the various terms in the model. New methods are developed for both traditional and fuzzy clustering that have useful theoretic and computational properties (principal cluster analysis, additive clustering, and so on). Connections to several known classification strategies are also stated.The author is grateful to P. Arabie and L. J. Hubert for editorial assistance and reviewing going well beyond traditional levels.  相似文献   

12.
Framework of this paper is statistical data editing, specifically how to edit or impute missing or contradictory data and how to merge two independent data sets presenting some lack of information. Assuming a missing at random mechanism, this paper provides an accurate tree-based methodology for both missing data imputation and data fusion that is justified within the Statistical Learning Theory of Vapnik. It considers both an incremental variable imputation method to improve computational efficiency as well as boosted trees to gain in prediction accuracy with respect to other methods. As a result, the best approximation of the structural risk (also known as irreducible error) is reached, thus reducing at minimum the generalization (or prediction) error of imputation. Moreover, it is distribution free, it holds independently of the underlying probability law generating missing data values. Performance analysis is discussed considering simulation case studies and real world applications.  相似文献   

13.
When clustering asymmetric proximity data, only the average amounts are often considered by assuming that the asymmetry is due to noise. But when the asymmetry is structural, as typically may happen for exchange flows, migration data or confusion data, this may strongly affect the search for the groups because the directions of the exchanges are ignored and not integrated in the clustering process. The clustering model proposed here relies on the decomposition of the asymmetric dissimilarity matrix into symmetric and skew-symmetric effects both decomposed in within and between cluster effects. The classification structures used here are generally based on two different partitions of the objects fitted to the symmetric and the skew-symmetric part of the data, respectively; the restricted case is also presented where the partition fits jointly both of them allowing for clusters of objects similar with respect to the average amounts and directions of the data. Parsimonious models are presented which allow for effective and simple graphical representations of the results.  相似文献   

14.
Several techniques are given for the uniform generation of trees for use in Monte Carlo studies of clustering and tree representations. First, general strategies are reviewed for random selection from a set of combinatorial objects with special emphasis on two that use random mapping operations. Theorems are given on how the number of such objects in the set (e.g., whether the number is prime) affects which strategies can be used. Based on these results, methods are presented for the random generation of six types of binary unordered trees. Three types of labeling and both rooted and unrooted forms are considered. Presentation of each method includes the theory of the method, the generation algorithm, an analysis of its computational complexity and comments on the distribution of trees over which it samples. Formal proofs and detailed algorithms are in appendices.  相似文献   

15.
To reveal the structure underlying two-way two-mode object by variable data, Mirkin (1987) has proposed an additive overlapping clustering model. This model implies an overlapping clustering of the objects and a reconstruction of the data, with the reconstructed variable profile of an object being a summation of the variable profiles of the clusters it belongs to. Grasping the additive (overlapping) clustering structure of object by variable data may, however, be seriously hampered in case the data include a very large number of variables. To deal with this problem, we propose a new model that simultaneously clusters the objects in overlapping clusters and reduces the variable space; as such, the model implies that the cluster profiles and, hence, the reconstructed data profiles are constrained to lie in a lowdimensional space. An alternating least squares (ALS) algorithm to fit the new model to a given data set will be presented, along with a simulation study and an illustrative example that makes use of empirical data.  相似文献   

16.
Improvements to the dynamic programming (DP) strategy for partitioning (nonhierarchical classification) as discussed in Hubert, Arabie, and Meulman (2001) are proposed. First, it is shown how the number of evaluations in the DP process can be decreased without affecting generality. Both a completely nonredundant and a quasi-nonredundant method are proposed. Second, an efficient implementation of both approaches is discussed. This implementation is shown to have a dramatic increase in speed over the original program. The flexibility of the approach is illustrated by analyzing three data sets.  相似文献   

17.
K -means partitioning. We also describe some new features and improvements to the algorithm proposed by De Soete. Monte Carlo simulations have been conducted using different error conditions. In all cases (i.e., ultrametric or additive trees, or K-means partitioning), the simulation results indicate that the optimal weighting procedure should be used for analyzing data containing noisy variables that do not contribute relevant information to the classification structure. However, if the data involve error-perturbed variables that are relevant to the classification or outliers, it seems better to cluster or partition the entities by using variables with equal weights. A new computer program, OVW, which is available to researchers as freeware, implements improved algorithms for optimal variable weighting for ultrametric and additive tree clustering, and includes a new algorithm for optimal variable weighting for K-means partitioning.  相似文献   

18.
A validation study of a variable weighting algorithm for cluster analysis   总被引:1,自引:0,他引:1  
De Soete (1986, 1988) proposed a variable weighting procedure when Euclidean distance is used as the dissimilarity measure with an ultrametric hierarchical clustering method. The algorithm produces weighted distances which approximate ultrametric distances as closely as possible in a least squares sense. The present simulation study examined the effectiveness of the De Soete procedure for an applications problem for which it was not originally intended. That is, to determine whether or not the algorithm can be used to reduce the influence of variables which are irrelevant to the clustering present in the data. The simulation study examined the ability of the procedure to recover a variety of known underlying cluster structures. The results indicate that the algorithm is effective in identifying extraneous variables which do not contribute information about the true cluster structure. Weights near 0.0 were typically assigned to such extraneous variables. Furthermore, the variable weighting procedure was not adversely effected by the presence of other forms of error in the data. In general, it is recommended that the variable weighting procedure be used for applied analyses when Euclidean distance is employed with ultrametric hierarchical clustering methods.  相似文献   

19.
We devise a classification algorithm based on generalized linear mixed model (GLMM) technology. The algorithm incorporates spline smoothing, additive model-type structures and model selection. For reasons of speed we employ the Laplace approximation, rather than Monte Carlo methods. Tests on real and simulated data show the algorithm to have good classification performance. Moreover, the resulting classifiers are generally interpretable and parsimonious.  相似文献   

20.
The Academic Journal Ranking Problem consists in formulating a formal assessment of scientific journals. An outcome variable must be constructed that allows valid journal comparison, either as a set of tiers (ordered classes) or as a numerical index. But part of the problem is also to devise a procedure to get this outcome, that is, how to get and use relevant data coming from expert opinions or from citations database. We propose a novel approach to the problem that applies fuzzy cluster analysis to peer reviews and opinion surveys. The procedure is composed of two steps: the first is to collect the most relevant qualitative assessments from international organizations (for example, the ones available in the Harzing database) and, as inductive analysis, to apply fuzzy clustering to determine homogeneous journal classes; the second deductive step is to determine the hidden logical rules that underlies the classification, using a classification tree to reproduce the same patterns of the first step.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号