首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 467 毫秒
1.
We propose functional cluster analysis (FCA) for multidimensional functional data sets, utilizing orthonormalized Gaussian basis functions. An essential point in FCA is the use of orthonormal bases that yield the identity matrix for the integral of the product of any two bases. We construct orthonormalized Gaussian basis functions using Cholesky decomposition and derive a property of Cholesky decomposition with respect to Gram-Schmidt orthonormalization. The advantages of the functional clustering are that it can be applied to the data observed at different time points for each subject, and the functional structure behind the data can be captured by removing the measurement errors. Numerical experiments are conducted to investigate the effectiveness of the proposed method, as compared to conventional discrete cluster analysis. The proposed method is applied to three-dimensional (3D) protein structural data that determine the 3D arrangement of amino acids in individual protein.  相似文献   

2.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach.  相似文献   

3.
A preliminary study of optimal variable weighting in k-means clustering   总被引:2,自引:0,他引:2  
Recently, algorithms for optimally weighting variables in non-hierarchical and hierarchical clustering methods have been proposed. Preliminary Monte Carlo research has shown that at least one of these algorithms cross-validates extremely well.The present study applies a k-means, optimal weighting procedure to two empirical data sets and contrasts its cross-validation performance with that of unit (i.e., equal) weighting of the variables. We find that the optimal weighting procedure cross-validates better in one of the two data sets. In the second data set its comparative performance strongly depends on the approach used to find seed values for the initial k-means partitioning.The authors would like to acknowledge the support of the Citibank Fellowship from the Sol C. Snider Entrepreneurial Center at the Wharton School. The authors would like to express their appreciation to J. Douglas Carroll and Abba M. Kreiger for comments on an earlier version of the paper.  相似文献   

4.
The mixture method of clustering applied to three-way data   总被引:3,自引:3,他引:0  
Clustering or classifying individuals into groups such that there is relative homogeneity within the groups and heterogeneity between the groups is a problem which has been considered for many years. Most available clustering techniques are applicable only to a two-way data set, where one of the modes is to be partitioned into groups on the basis of the other mode. Suppose, however, that the data set is three-way. Then what is needed is a multivariate technique which will cluster one of the modes on the basis of both of the other modes simultaneously. It is shown that by appropriate specification of the underlying model, the mixture maximum likelihood approach to clustering can be applied in the context of a three-way table. It is illustrated using a soybean data set which consists of multiattribute measurements on a number of genotypes each grown in several environments. Although the problem is set in the framework of clustering genotypes, the technique is applicable to other types of three-way data sets.  相似文献   

5.
In many statistical applications data are curves measured as functions of a continuous parameter as time. Despite of their functional nature and due to discrete-time observation, these type of data are usually analyzed with multivariate statistical methods that do not take into account the high correlation between observations of a single curve at nearby time points. Functional data analysis methodologies have been developed to solve these type of problems. In order to predict the class membership (multi-category response variable) associated to an observed curve (functional data), a functional generalized logit model is proposed. Base-line category logit formulations will be considered and their estimation based on basis expansions of the sample curves of the functional predictor and parameters. Functional principal component analysis will be used to get an accurate estimation of the functional parameters and to classify sample curves in the categories of the response variable. The good performance of the proposed methodology will be studied by developing an experimental study with simulated and real data.  相似文献   

6.
This paper introduces a novel mixture model-based approach to the simultaneous clustering and optimal segmentation of functional data, which are curves presenting regime changes. The proposed model consists of a finite mixture of piecewise polynomial regression models. Each piecewise polynomial regression model is associated with a cluster, and within each cluster, each piecewise polynomial component is associated with a regime (i.e., a segment). We derive two approaches to learning the model parameters: the first is an estimation approach which maximizes the observed-data likelihood via a dedicated expectation-maximization (EM) algorithm, then yielding a fuzzy partition of the curves into K clusters obtained at convergence by maximizing the posterior cluster probabilities. The second is a classification approach and optimizes a specific classification likelihood criterion through a dedicated classification expectation-maximization (CEM) algorithm. The optimal curve segmentation is performed by using dynamic programming. In the classification approach, both the curve clustering and the optimal segmentation are performed simultaneously as the CEM learning proceeds. We show that the classification approach is a probabilistic version generalizing the deterministic K-means-like algorithm proposed in Hébrail, Hugueney, Lechevallier, and Rossi (2010). The proposed approach is evaluated using simulated curves and real-world curves. Comparisons with alternatives including regression mixture models and the K-means-like algorithm for piecewise regression demonstrate the effectiveness of the proposed approach.  相似文献   

7.
The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical application. A simple and common approach is to first cluster a random sample of moderate size, and then use the clustering model found in this way to classify the remainder of the objects. We show that, in its simplest form, this method may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling method: several tentative models are identified from the sample instead of just one, and several EM steps are used rather than just one E step to classify the full data set. We find that there are significant gains from increasing the size of the sample up to about 2,000, but not from further increases. These conclusions are based on the application of several alternative strategies to the segmentation of three different multispectral images, and to several simulated data sets.  相似文献   

8.
In this paper two techniques for units clustering and factorial dimensionality reduction of variables and occasions of a three-mode data set are discussed. These techniques can be seen as the simultaneous version of two procedures based on the sequential application of k-means and Tucker2 algorithms and vice versa. The two techniques, T3Clus and 3Fk-means, have been compared theoretically and empirically by a simulation study. In the latter, it has been noted that neither T3Clus nor 3Fk-means outperforms the other in every case. From these results rises the idea to combine the two techniques in a unique general model, named CT3Clus, having T3Clus and 3Fk-means as special cases. A simulation study follows to show the effectiveness of the proposal.  相似文献   

9.
Traditional procedures for clustering time series are based mostly on crisp hierarchical or partitioning methods. Given that the dynamics of a time series may change over time, a time series might display patterns that may enable it to belong to one cluster over one period while over another period, its pattern may be more consistent with those in another cluster. The traditional clustering procedures are unable to identify the changing patterns over time. However, clustering based on fuzzy logic will be able to detect the switching patterns from one time period to another thus enabling some time series to simultaneously belong to more than one cluster. In particular, this paper proposes a fuzzy approach to the clustering of time series based on their variances through wavelet decomposition. We will show that this approach will distinguish between time series with different patterns in variability as well identifying time series with switching patterns in variability.  相似文献   

10.
The Kohonen self-organizing map method: An assessment   总被引:1,自引:0,他引:1  
The “self-organizing map” method, due to Kohonen, is a well-known neural network method. It is closely related to cluster analysis (partitioning) and other methods of data analysis. In this article, we explore some of these close relationships. A number of properties of the technique are discussed. Comparisons with various methods of data analysis (principal components analysis, k-means clustering, and others) are presented. This work has been partially supported for M. Hernández-Pajares by the DGCICIT of Spain under grant No. PB90-0478 and by a CESCA-1993 computer-time grant. Fionn Murtagh is affiliated to the Astrophysics Division, Space Science Department, European Space Agency.  相似文献   

11.
A random sample of sizeN is divided intok clusters that minimize the within clusters sum of squares locally. Some large sample properties of this k-means clustering method (ask approaches withN) are obtained. In one dimension, it is established that the sample k-means clusters are such that the within-cluster sums of squares are asymptotically equal, and that the sizes of the cluster intervals are inversely proportional to the one-third power of the underlying density at the midpoints of the intervals. The difficulty involved in generalizing the results to the multivariate case is mentioned.This research was supported in part by the National Science Foundation under Grant MCS75-08374. The author would like to thank John Hartigan and David Pollard for helpful discussions and comments.  相似文献   

12.
The more ways there are of understanding a clustering technique, the more effectively the results can be analyzed and used. I will give a general procedure, calledparameter modification, to obtain from a clustering criterion a variety of equivalent forms of the criterion. These alternative forms reveal aspects of the technique that are not necessarily apparent in the original formulation. This procedure is successful in improving the understanding of a significant number of clustering techniques.The insight obtained will be illustrated by applying parameter modification to partitioning, mixture and fuzzy clustering methods, resulting in a unified approach to the study of these methods and a general algorithm for optimizing them.The author wishes to thank Professor Doctor Hans-Hermann Bock for many stimulating discussions.  相似文献   

13.
Variable Selection for Clustering and Classification   总被引:2,自引:2,他引:0  
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.  相似文献   

14.
Optimization Strategies for Two-Mode Partitioning   总被引:2,自引:2,他引:0  
Two-mode partitioning is a relatively new form of clustering that clusters both rows and columns of a data matrix. In this paper, we consider deterministic two-mode partitioning methods in which a criterion similar to k-means is optimized. A variety of optimization methods have been proposed for this type of problem. However, it is still unclear which method should be used, as various methods may lead to non-global optima. This paper reviews and compares several optimization methods for two-mode partitioning. Several known methods are discussed, and a new fuzzy steps method is introduced. The fuzzy steps method is based on the fuzzy c-means algorithm of Bezdek (1981) and the fuzzy steps approach of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performances of all methods are compared in a large simulation study. In our simulations, a two-mode k-means optimization method most often gives the best results. Finally, an empirical data set is used to give a practical example of two-mode partitioning. We would like to thank two anonymous referees whose comments have improved the quality of this paper. We are also grateful to Peter Verhoef for providing the data set used in this paper.  相似文献   

15.
X is the automatic hierarchical classification of one mode (units or variables or occasions) of X on the basis of the other two. In this paper the case of OMC of units according to variables and occasions is discussed. OMC is the synthesis of a set of hierarchical classifications Delta obtained from X; e.g., the OMC of units is the consensus (synthesis) among the set of dendograms individually defined by clustering units on the basis of variables, separately for each given occasion of X. However, because Delta is often formed by a large number of classifications, it may be unrealistic that a single synthesis is representative of the entire set. In this case, subsets of similar (homegeneous) dendograms may be found in Delta so that a consensus representative of each subset may be identified. This paper proposes, PARtition and Least Squares Consensus cLassifications Analysis (PARLSCLA) of a set of r hierarchical classifications Delta. PARLSCLA identifies the best least-squares partition of Delta into m (1 <= m <= r) subsets of homogeneous dendograms and simultaneously detects the closest consensus classification (a median classification called Least Squares Consensus Dendogram (LSCD) for each subset. PARLSCLA is a generalization of the problem to find a least-squares consensus dendogram for Delta. PARLSCLA is formalized as a mixed-integer programming problem and solved with an iterative, two-step algorithm. The method proposed is applied to an empirical data set.  相似文献   

16.
This paper presents a Bayesian model based clustering approach for dichotomous item responses that deals with issues often encountered in model based clustering like missing data, large data sets and within cluster dependencies. The approach proposed will be illustrated using an example concerning Brand Strategy Research.  相似文献   

17.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.  相似文献   

18.
The paper contains a proposal of interval data clustering related to given social and economic objects characterized by many interval variables. This multivariate approach is based on an original conception of interval quantiles constructed using a special definition derived from the notion of the Hausdorff distance. In order to improve the quality of classification, the obtained interval quantile classes can be next aggregated into larger merged classes. The efficiency of our method can be assessed using especially defined indices of entropy and volume coefficients. The second notion replaces the classical concept of area, which is not applicable in this case.  相似文献   

19.
Data in many different fields come to practitioners through a process naturally described as functional. We propose a classification procedure of oxidation curves. Our algorithm is based on two stages: fitting the functional data by linear splines with free knots and classifying the estimated knots which estimate useful oxidation parameters. A real data set on 57 oxidation curves is used to illustrate our approach.  相似文献   

20.
Multiple imputation is one of the most highly recommended procedures for dealing with missing data. However, to date little attention has been paid to methods for combining the results from principal component analyses applied to a multiply imputed data set. In this paper we propose Generalized Procrustes analysis for this purpose, of which its centroid solution can be used as a final estimate for the component loadings. Convex hulls based on the loadings of the imputed data sets can be used to represent the uncertainty due to the missing data. In two simulation studies, the performance of Generalized Procrustes approach is evaluated and compared with other methods. More specifically it is studied how these methods behave when order changes of components and sign reversals of component loadings occur, such as in case of near-equal eigenvalues, or data having almost as many counterindicative items as indicative items. The simulations show that other proposed methods either may run into serious problems or are not able to adequately assess the accuracy due to the presence of missing data. However, when the above situations do not occur, all methods will provide adequate estimates for the PCA loadings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号