共查询到20条相似文献,搜索用时 0 毫秒
1.
Clustering Functional Data 总被引:1,自引:0,他引:1
2.
Simultaneous Component and Clustering Models for Three-way Data: Within and Between Approaches 总被引:2,自引:0,他引:2
In this paper two techniques for units clustering and factorial dimensionality reduction of variables and occasions of a three-mode
data set are discussed. These techniques can be seen as the simultaneous version of two procedures based on the sequential
application of k-means and Tucker2 algorithms and vice versa. The two techniques, T3Clus and 3Fk-means, have been compared
theoretically and empirically by a simulation study. In the latter, it has been noted that neither T3Clus nor 3Fk-means outperforms
the other in every case. From these results rises the idea to combine the two techniques in a unique general model, named
CT3Clus, having T3Clus and 3Fk-means as special cases. A simulation study follows to show the effectiveness of the proposal. 相似文献
3.
Normal mixture models are widely used for statistical modeling of data, including cluster analysis.
However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities
or degeneracies. To avoid this, we propose replacing the MLE by a maximum a posteriori (MAP) estimator, also found by the
EM algorithm. For choosing the number of components and the model parameterization, we propose a modified version of BIC,
where the likelihood is evaluated at the MAP instead of the MLE. We use a highly dispersed proper conjugate prior, containing
a small fraction of one observation's worth of information. The resulting method avoids degeneracies and singularities, but
when these are not present it gives similar results to the standard method using MLE, EM and BIC. 相似文献
4.
Ron Wehrens Lutgarde M.C. Buydens Chris Fraley Adrian E. Raftery 《Journal of Classification》2004,21(2):231-253
The rapid increase in the size of data sets makes clustering all the more important
to capture and summarize the information, at the same time making clustering more
difficult to accomplish. If model-based clustering is applied directly to a large data set, it
can be too slow for practical application. A simple and common approach is to first cluster
a random sample of moderate size, and then use the clustering model found in this way
to classify the remainder of the objects. We show that, in its simplest form, this method
may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling
method: several tentative models are identified from the sample instead of just one, and
several EM steps are used rather than just one E step to classify the full data set. We find
that there are significant gains from increasing the size of the sample up to about 2,000,
but not from further increases. These conclusions are based on the application of several
alternative strategies to the segmentation of three different multispectral images, and to
several simulated data sets. 相似文献
5.
6.
This paper studies the problem of estimating the number of clusters in the context of logistic regression clustering. The classification likelihood approach is employed to tackle this problem. A model-selection based criterion for selecting the number of logistic curves is proposed and its asymptotic property is also considered. The small sample performance of the proposed criterion is studied by Monto Carlo simulation. In addition, a real data example is presented. The authors would like to thank the editor, Prof. Willem J. Heiser, and the anonymous referees for the valuable comments and suggestions, which have led to the improvement of this paper. 相似文献
7.
Dealing with Distances and Transformations for Fuzzy C-Means Clustering of Compositional Data 总被引:1,自引:0,他引:1
Javier Palarea-Albaladejo Josep Antoni Martín-Fernández Jesús A. Soto 《Journal of Classification》2012,29(2):144-169
Clustering techniques are based upon a dissimilarity or distance measure between objects and clusters. This paper focuses on the simplex space, whose elements??compositions??are subject to non-negativity and constant-sum constraints. Any data analysis involving compositions should fulfill two main principles: scale invariance and subcompositional coherence. Among fuzzy clustering methods, the FCM algorithm is broadly applied in a variety of fields, but it is not well-behaved when dealing with compositions. Here, the adequacy of different dissimilarities in the simplex, together with the behavior of the common log-ratio transformations, is discussed in the basis of compositional principles. As a result, a well-founded strategy for FCM clustering of compositions is suggested. Theoretical findings are accompanied by numerical evidence, and a detailed account of our proposal is provided. Finally, a case study is illustrated using a nutritional data set known in the clustering literature. 相似文献
8.
9.
T clusters, based on J distinct, contributory partitions (or, equivalently, J polytomous attributes). We describe a new model/algorithm for implementing this objective. The method's objective function incorporates a modified Rand measure, both in initial cluster selection and in subsequent refinement of the starting partition. The method is applied to both synthetic and real data. The performance of the proposed model is compared to latent class analysis of the same data set. 相似文献
10.
11.
In this study, we consider the type of interval data summarizing the original samples (individuals) with classical point data. This type of interval data are termed interval symbolic data in a new research domain called, symbolic data analysis. Most of the existing research, such as the (centre, radius) and [lower boundary, upper boundary] representations, represent an interval using only the boundaries of the interval. However, these representations hold true only under the assumption that the individuals contained in the interval follow a uniform distribution. In practice, such representations may result in not only inconsistency with the facts, since the individuals are usually not uniformly distributed in many application aspects, but also information loss for not considering the point data within the intervals during the calculation. In this study, we propose a new representation of the interval symbolic data considering the point data contained in the intervals. Then we apply the city-block distance metric to the new representation and propose a dynamic clustering approach for interval symbolic data. A simulation experiment is conducted to evaluate the performance of our method. The results show that, when the individuals contained in the interval do not follow a uniform distribution, the proposed method significantly outperforms the Hausdorff and city-block distance based on traditional representation in the context of dynamic clustering. Finally, we give an application example on the automobile data set. 相似文献
12.
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets. 相似文献
13.
A column generation based approach is proposed for solving the cluster-wise regression problem. The proposed strategy relies firstly on several efficient heuristic strategies to insert columns into the restricted master problem. If these heuristics fail to identify an improving column, an exhaustive search is performed starting with incrementally larger ending subsets, all the while iteratively performing heuristic optimization to ensure a proper balance of exact and heuristic optimization. Additionally, observations are sequenced by their dual variables and by their inclusion in joint pair branching rules. The proposed strategy is shown to outperform the best known alternative (BBHSE) when the number of clusters is greater than three. Additionally, the current work further demonstrates and expands the successful use of the new paradigm of using incrementally larger ending subsets to strengthen the lower bounds of a branch and bound search as pioneered by Brusco's Repetitive Branch and Bound Algorithm (RBBA). 相似文献
14.
Clustering criteria for discrete data and latent class models 总被引:1,自引:0,他引:1
We show that a well-known clustering criterion for discrete data, the information criterion, is closely related to the classification
maximum likelihood criterion for the latent class model. This relation can be derived from the Bryant-Windham construction.
Emphasis is placed on binary clustering criteria which are analyzed under the maximum likelihood approach for different multivariate
Bernoulli mixtures. This alternative form of criterion reveals non-apparent aspects of clustering techniques. All the criteria
discussed can be optimized with the alternating optimization algorithm. Some illustrative applications are included.
Résumé Nous montrons que le critère de classification de l'information, souvent utilisé pour les données discrètes, est très lié au critère du maximum de vraisemblance classifiante appliqué au modèle des classes latentes. Ce lien peut être analysé sous l'approche de la paramétrisation de Bryant-Windham. L'accent est mis sur le cas des données binaires qui sont analysées sous l'approche du maximum de vraisemblance pour les mélanges de distributions multivariées de Bernoulli. Cette forme de critère permet de mettre en évidence des aspects cachés des méthodes de classification de données binaires. Tous les critères envisagés ici peuvent être optimisés avec l'algorithme d'optimisation alternée. Des exemples concluent cet article.相似文献
15.
We examine the problem of aggregating several partitions of a finite set into a single consensus partition We note that the dual concepts of clustering and isolation are especially significant in this connection. The hypothesis that a consensus partition should respect unanimity with respect to either concept leads us to stress a consensus interval rather than a single partition. The extremes of this interval are characterized axiomatically. If a sufficient totality of traits has been measured, and if measurement errors are independent, then a true classifying partition can be expected to lie in the consensus interval. The structure of the partitions in the interval lends itself to partial solutions of the consensus problem Conditional entropy may be used to quantify the uncertainty inherent in the interval as a whole 相似文献
16.
Christian Hennig 《Journal of Classification》2002,19(2):249-276
In this paper an
algorithm is developed, which aims to find all FPCs of a dataset corresponding
to well separated linear regression subpopulations. Its ability to find such
subpopulations under the occurence of outliers is compared to methods based on
ML-estimation of mixture models by means of a simulation study. Furthermore,
FPC analysis is applied to a real dataset. 相似文献
17.
Seong Keon Lee 《Journal of Classification》2006,23(1):123-141
In many application fields, multivariate approaches that simultaneously consider the correlation between responses are needed.
The tree method can be extended to multivariate responses, such as repeated measure and longitudinal data, by modifying the
split function so as to accommodate multiple responses. Recently, researchers have constructed some decision trees for multiple
continuous longitudinal response and multiple binary responses using Mahalanobis distance and a generalized entropy index.
However, these methods have limitations according to the type of response, that is, those that are only continuous or binary.
In this paper, we will modify the tree for univariate response procedure and suggest a new tree-based method that can analyze
any type of multiple responses by using GEE (generalized estimating equations) techniques. To compare the performance of trees,
simulation studies on selection probability of true split variable will be shown. Finally, applications using epileptic seizure
data and WWW data are introduced. 相似文献
18.
决策是管理的核心,它贯穿于管理的全过程,因此我们需要通过了解一线科学家的需求,研究科学基金国际合作面临的问题,采用更科学的分析方法提高科学基金国际合作决策的科学化。本文采用数据挖掘技术分析了科学基金国际合作历史数据及科学家需求信息,发现了一些值得注意的现象和若干问题,并提出了一些政策性建议,以期提高科学基金国际合作的决策和管理水平。 相似文献
19.
Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software 总被引:1,自引:0,他引:1
K -means partitioning. We also describe some new features and improvements to the algorithm proposed by De Soete. Monte Carlo simulations have been conducted using different error conditions. In all cases (i.e., ultrametric or additive trees, or K-means partitioning), the simulation results indicate that the optimal weighting procedure should be used for analyzing data containing noisy variables that do not contribute relevant information to the classification structure. However, if the data involve error-perturbed variables that are relevant to the classification or outliers, it seems better to cluster or partition the entities by using variables with equal weights. A new computer program, OVW, which is available to researchers as freeware, implements improved algorithms for optimal variable weighting for ultrametric and additive tree clustering, and includes a new algorithm for optimal variable weighting for K-means partitioning. 相似文献
20.
概念体系构建和术语工作是制定任何标准的基础,在多学科和多领域的场景中,构建概念体系和术语工作面临不同利益相关方需求不同而难以达成共识的巨大挑战.文章梳理了ITU-T FG-DPM在促进不同利益相关方和项目组成员之间达成通用概念共识构建统一术语及定义的经验,通过规范概念体系的构建过程,采用术语多维度协同视角来构建统一的数... 相似文献