首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The objective of this paper is to develop the maximum likelihood approach for analyzing a finite mixture of structural equation models with missing data that are missing at random. A Monte Carlo EM algorithm is proposed for obtaining the maximum likelihood estimates. A well-known statistic in model comparison, namely the Bayesian Information Criterion (BIC), is used for model comparison. With the presence of missing data, the computation of the observed-data likelihood function value involved in the BIC is not straightforward. A procedure based on path sampling is developed to compute this function value. It is shown by means of simulation studies that ignoring the incomplete data with missing entries gives less accurate ML estimates. An illustrative real example is also presented.  相似文献   

2.
MCLUST is a software package for model-based clustering, density estimation and discriminant analysis interfaced to the S-PLUS commercial software and the R language. It implements parameterized Gaussian hierarchical clustering algorithms and the EM algorithm for parameterized Gaussian mixture models with the possible addition of a Poisson noise term. Also included are functions that combine hierarchical clustering, EM and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation, and discriminant analysis. MCLUST provides functionality for displaying and visualizing clustering and classification results. A web page with related links can be found at .  相似文献   

3.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach.  相似文献   

4.
A mixture likelihood approach for generalized linear models   总被引:6,自引:0,他引:6  
A mixture model approach is developed that simultaneously estimates the posterior membership probabilities of observations to a number of unobservable groups or latent classes, and the parameters of a generalized linear model which relates the observations, distributed according to some member of the exponential family, to a set of specified covariates within each Class. We demonstrate how this approach handles many of the existing latent class regression procedures as special cases, as well as a host of other parametric specifications in the exponential family heretofore not mentioned in the latent class literature. As such we generalize the McCullagh and Nelder approach to a latent class framework. The parameters are estimated using maximum likelihood, and an EM algorithm for estimation is provided. A Monte Carlo study of the performance of the algorithm for several distributions is provided, and the model is illustrated in two empirical applications.  相似文献   

5.
A probabilistic DEDICOM model was proposed for mobility tables. The model attempts to explain observed transition probabilities by a latent mobility table and a set of transition probabilities from latent classes to observed classes. The model captures asymmetry in observed mobility tables by asymmetric latent mobility tables. It may be viewed as a special case of both the latent class model and DEDICOM with special constraints. A maximum penalized likelihood (MPL) method was developed for parameter estimation. The EM algorithm was adapted for the MPL estimation. Two examples were given to illustrate the proposed method. The work reported in this paper has been supported by grant A6394 to the first author from the Natural Sciences and Engineering Research Council of Canada and by a fellowship of the Royal Netherlands Academy of Arts and Sciences to the second author. We would like to thank anonymous reviewers for their insightful comments.  相似文献   

6.
We present a new distance based quartet method for phylogenetic tree reconstruction, called Minimum Tree Cost Quartet Puzzling. Starting from a distance matrix computed from natural data, the algorithm incrementally constructs a tree by adding one taxon at a time to the intermediary tree using a cost function based on the relaxed 4-point condition for weighting quartets. Different input orders of taxa lead to trees having distinct topologies which can be evaluated using a maximum likelihood or weighted least squares optimality criterion. Using reduced sets of quartets and a simple heuristic tree search strategy we obtain an overall complexity of O(n 5 log2 n) for the algorithm. We evaluate the performances of the method through comparative tests and show that our method outperforms NJ when a weighted least squares optimality criterion is employed. We also discuss the theoretical boundaries of the algorithm.  相似文献   

7.
This paper introduces a novel mixture model-based approach to the simultaneous clustering and optimal segmentation of functional data, which are curves presenting regime changes. The proposed model consists of a finite mixture of piecewise polynomial regression models. Each piecewise polynomial regression model is associated with a cluster, and within each cluster, each piecewise polynomial component is associated with a regime (i.e., a segment). We derive two approaches to learning the model parameters: the first is an estimation approach which maximizes the observed-data likelihood via a dedicated expectation-maximization (EM) algorithm, then yielding a fuzzy partition of the curves into K clusters obtained at convergence by maximizing the posterior cluster probabilities. The second is a classification approach and optimizes a specific classification likelihood criterion through a dedicated classification expectation-maximization (CEM) algorithm. The optimal curve segmentation is performed by using dynamic programming. In the classification approach, both the curve clustering and the optimal segmentation are performed simultaneously as the CEM learning proceeds. We show that the classification approach is a probabilistic version generalizing the deterministic K-means-like algorithm proposed in Hébrail, Hugueney, Lechevallier, and Rossi (2010). The proposed approach is evaluated using simulated curves and real-world curves. Comparisons with alternatives including regression mixture models and the K-means-like algorithm for piecewise regression demonstrate the effectiveness of the proposed approach.  相似文献   

8.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.  相似文献   

9.
a posteriori blockmodeling for graphs is proposed. The model assumes that the vertices of the graph are partitioned into two unknown blocks and that the probability of an edge between two vertices depends only on the blocks to which they belong. Statistical procedures are derived for estimating the probabilities of edges and for predicting the block structure from observations of the edge pattern only. ML estimators can be computed using the EM algorithm, but this strategy is practical only for small graphs. A Bayesian estimator, based on the Gibbs sampling, is proposed. This estimator is practical also for large graphs. When ML estimators are used, the block structure can be predicted based on predictive likelihood. When Gibbs sampling is used, the block structure can be predicted from posterior predictive probabilities. A side result is that when the number of vertices tends to infinity while the probabilities remain constant, the block structure can be recovered correctly with probability tending to 1.  相似文献   

10.
A maximum likelihood methodology for clusterwise linear regression   总被引:9,自引:0,他引:9  
This paper presents a conditional mixture, maximum likelihood methodology for performing clusterwise linear regression. This new methodology simultaneously estimates separate regression functions and membership inK clusters or groups. A review of related procedures is discussed with an associated critique. The conditional mixture, maximum likelihood methodology is introduced together with the E-M algorithm utilized for parameter estimation. A Monte Carlo analysis is performed via a fractional factorial design to examine the performance of the procedure. Next, a marketing application is presented concerning the evaluations of trade show performance by senior marketing executives. Finally, other potential applications and directions for future research are identified.  相似文献   

11.
A permutation-based algorithm for block clustering   总被引:2,自引:1,他引:1  
Hartigan (1972) discusses the direct clustering of a matrix of data into homogeneous blocks. He introduces a stepwise divisive method for block clustering within a certain class of block structures which induce clustering trees for both row and column margins. While this class of structures is appealing, the stopping criterion for his method, which is based on asymptotic theory and the assumption that the individual elements of the data matrix are normally distributed, is quite restrictive. In this paper we propose a permutation-based algorithm for block clustering within the same class of block structures. By using permutation arguments to decide where to split and when to stop, our algorithm becomes applicable in a wide variety of cases, including matrices of categorical data and matrices of small-to-moderate size. In addition, our algorithm offers considerable flexibility in how block homogeneity is defined. The algorithm is studied in a series of simulation experiments on matrices of known structure, and illustrated in examples drawn from the fields of taxonomy, political science, and data architecture.  相似文献   

12.
This paper develops a new procedure for simultaneously performing multidimensional scaling and cluster analysis on two-way compositional data of proportions. The objective of the proposed procedure is to delineate patterns of variability in compositions across subjects by simultaneously clustering subjects into latent classes or groups and estimating a joint space of stimulus coordinates and class-specific vectors in a multidimensional space. We use a conditional mixture, maximum likelihood framework with an E-M algorithm for parameter estimation. The proposed procedure is illustrated using a compositional data set reflecting proportions of viewing time across television networks for an area sample of households.  相似文献   

13.
We describe a simple time series transformation to detect differences in series that can be accurately modelled as stationary autoregressive (AR) processes. The transformation involves forming the histogram of above and below the mean run lengths. The run length (RL) transformation has the benefits of being very fast, compact and updatable for new data in constant time. Furthermore, it can be generated directly from data that has already been highly compressed. We first establish the theoretical asymptotic relationship between run length distributions and AR models through consideration of the zero crossing probability and the distribution of runs. We benchmark our transformation against two alternatives: the truncated Autocorrelation function (ACF) transform and the AR transformation, which involves the standard method of fitting the partial autocorrelation coefficients with the Durbin-Levinson recursions and using the Akaike Information Criterion stopping procedure. Whilst optimal in the idealized scenario, representing the data in these ways is time consuming and the representation cannot be updated online for new data. We show that for classification problems the accuracy obtained through using the run length distribution tends towards that obtained from using the full fitted models. We then propose three alternative distance measures for run length distributions based on Gower’s general similarity coefficient, the likelihood ratio and dynamic time warping (DTW). Through simulated classification experiments we show that a nearest neighbour distance based on DTW converges to the optimal faster than classifiers based on Euclidean distance, Gower’s coefficient and the likelihood ratio. We experiment with a variety of classifiers and demonstrate that although the RL transform requires more data than the best performing classifier to achieve the same accuracy as AR or ACF, this factor is at worst non-increasing with the series length, m, whereas the relative time taken to fit AR and ACF increases with m. We conclude that if the data is stationary and can be suitably modelled by an AR series, and if time is an important factor in reaching a discriminatory decision, then the run length distribution transform is a simple and effective transformation to use.  相似文献   

14.
A common approach to deal with missing values in multivariate exploratory data analysis consists in minimizing the loss function over all non-missing elements, which can be achieved by EM-type algorithms where an iterative imputation of the missing values is performed during the estimation of the axes and components. This paper proposes such an algorithm, named iterative multiple correspondence analysis, to handle missing values in multiple correspondence analysis (MCA). The algorithm, based on an iterative PCA algorithm, is described and its properties are studied. We point out the overfitting problem and propose a regularized version of the algorithm to overcome this major issue. Finally, performances of the regularized iterative MCA algorithm (implemented in the R-package named missMDA) are assessed from both simulations and a real dataset. Results are promising with respect to other methods such as the missing-data passive modified margin method, an adaptation of the missing passive method used in Gifi’s Homogeneity analysis framework.  相似文献   

15.
Using a natural metric on the space of networks, we define a probability measure for network-valued random variables. This measure is indexed by two parameters, which are interpretable as a location parameter and a dispersion parameter. From this structure, one can develop maximum likelihood estimates, hypothesis tests and confidence regions, all in the context of independent and identically distributed networks. The value of this perspective is illustrated through application to portions of the friedship cognitive social structure data gathered by Krackhardt (1987).We thank Ove Frank, David Krackhardt, the editor and the referees for their constructive comments and suggestions.  相似文献   

16.
Consider N entities to be classified (e.g., geographical areas), a matrix of dissimilarities between pairs of entities, a graph H with vertices associated with these entities such that the edges join the vertices corresponding to contiguous entities. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside of it. The single-linkage algorithm (ignoring contiguity between entities) provides partitions into M clusters for which the smallest split of the clusters, called split of the partition, is maximum. We study here the partitioning of the set of entities into M connected clusters for all M between N - 1 and 2 (i.e., clusters such that the subgraphs of H induced by their corresponding sets of entities are connected) with maximum split subject to that condition. We first provide an exact algorithm with a (N2) complexity for the particular case in which H is a tree. This algorithm suggests in turn a first heuristic algorithm for the general problem. Several variants of this heuristic are Also explored. We then present an exact algorithm for the general case based on iterative determination of cocycles of subtrees and on the solution of auxiliary set covering problems. As solution of the latter problems is time-consuming for large instances, we provide another heuristic in which the auxiliary set covering problems are solved approximately. Computational results obtained with the exact and heuristic algorithms are presented on test problems from the literature.  相似文献   

17.
Clustering criteria for discrete data and latent class models   总被引:1,自引:0,他引:1  
We show that a well-known clustering criterion for discrete data, the information criterion, is closely related to the classification maximum likelihood criterion for the latent class model. This relation can be derived from the Bryant-Windham construction. Emphasis is placed on binary clustering criteria which are analyzed under the maximum likelihood approach for different multivariate Bernoulli mixtures. This alternative form of criterion reveals non-apparent aspects of clustering techniques. All the criteria discussed can be optimized with the alternating optimization algorithm. Some illustrative applications are included.
Résumé Nous montrons que le critère de classification de l'information, souvent utilisé pour les données discrètes, est très lié au critère du maximum de vraisemblance classifiante appliqué au modèle des classes latentes. Ce lien peut être analysé sous l'approche de la paramétrisation de Bryant-Windham. L'accent est mis sur le cas des données binaires qui sont analysées sous l'approche du maximum de vraisemblance pour les mélanges de distributions multivariées de Bernoulli. Cette forme de critère permet de mettre en évidence des aspects cachés des méthodes de classification de données binaires. Tous les critères envisagés ici peuvent être optimisés avec l'algorithme d'optimisation alternée. Des exemples concluent cet article.
  相似文献   

18.
We propose and discuss improved Bayes rules to discriminate between two populations using ordered predictors. To address the problem we propose an alternative formulation using a latent space that allows to introduce the information about the order in the theoretical rules. The rules are first defined when the marginal densities are fully known and then under normality when the parameters are unknown and training samples are available. Several numerical examples and simulations in the paper illustrate the methodology and show that the new rules handle the information appropriately. We compare the new rules with the classical Bayes and Fisher rules in these examples and we show that the misclassification probability is smaller for the new rules. The method is also applied to data from a diabetes study where we again show that the new rules improve over the usual Fisher rule. Research partially supported by Spanish DGES and by PAPIJCL. The authors thank the editor and an anonymous reviewer for their detailed reading that resulted in this much improved version of the paper.  相似文献   

19.
Finite mixture modeling is a popular statistical technique capable of accounting for various shapes in data. One popular application of mixture models is model-based clustering. This paper considers the problem of clustering regression autoregressive moving average time series. Two novel estimation procedures for the considered framework are developed. The first one yields the conditional maximum likelihood estimates which can be used in cases when the length of times series is substantial. Simple analytical expressions make fast parameter estimation possible. The second method incorporates the Kalman filter and yields the exact maximum likelihood estimates. The procedure for assessing variability in obtained estimates is discussed. We also show that the Bayesian information criterion can be successfully used to choose the optimal number of mixture components and correctly assess time series orders. The performance of the developed methodology is evaluated on simulation studies. An application to the analysis of tree ring data is thoroughly considered. The results are very promising as the proposed approach overcomes the limitations of other methods developed so far.  相似文献   

20.
Complete linkage as a multiple stopping rule for single linkage clustering   总被引:2,自引:2,他引:0  
Two commonly used clustering criteria are single linkage, which maximizes the minimum distance between clusters, and complete linkage, which minimizes the maximum distance within a cluster. By synthesizing these criteria, partitions of objects are sought which maximize a combined measure of the minimum distance between clusters and the maximum distance within a cluster. Each combined measure is shown to select a partition in the single linkage hierarchy. Therefore, in effect, complete linkage is used to provide a stopping rule for single linkage. An algorithm is outlined which uses the distance between each pair of objects twice only. To illustrate the method, an example is given using 23 Glamorganshire soil profiles.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号