首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 736 毫秒
1.
In this paper we show how biplot methodology can be combined with various forms of discriminant analyses leading to highly informative visual displays of the respective class separations. It is demonstrated that the concept of distance as applied to discriminant analysis provides a unified approach to a wide variety of discriminant analysis procedures that can be accommodated by just changing to an appropriate distance metric. These changes in the distance metric are crucial for the construction of appropriate biplots. Several new types of biplots viz. quadratic discriminant analysis biplots for use with heteroscedastic stratified data, discriminant subspace biplots and flexible discriminant analysis biplots are derived and their use illustrated. Advantages of the proposed procedures are pointed out. Although biplot methodology is in particular well suited for complementing J > 2 classes discrimination problems its use in 2-class problems is also illustrated.  相似文献   

2.
We consider applying a functional logistic discriminant procedure to the analysis of handwritten character data. Time-course trajectories corresponding to the X and Y coordinate values of handwritten characters written in the air with one finger are converted into a functional data set via regularized basis expansion. We then apply functional logistic modeling to classify the functions into several classes. In order to select the values of adjusted parameters involved in the functional logistic model, we derive a model selection criterion for evaluating models estimated by the method of regularization. Results indicate the effectiveness of our modeling strategy in terms of prediction accuracy.  相似文献   

3.
I consider a new problem of classification into n(n ≥ 2) disjoint classes based on features of unclassified data. It is assumed that the data are grouped into m(M ≥ n) disjoint sets and within each set the distribution of features is a mixture of distributions corresponding to particular classes. Moreover, the mixing proportions should be known and form a matrix of rank n. The idea of solution is, first, to estimate feature densities in all the groups, then to solve the linear system for component densities. The proposed classification method is asymptotically optimal, provided a consistent method of density estimation is used. For illustration, the method is applied to determining perfusion status in myocardial infarction patients, using creatine kinase measurements.  相似文献   

4.
Spectral analysis of phylogenetic data   总被引:12,自引:0,他引:12  
The spectral analysis of sequence and distance data is a new approach to phylogenetic analysis. For two-state character sequences, the character values at a given site split the set of taxa into two subsets, a bipartition of the taxa set. The vector which counts the relative numbers of each of these bipartitions over all sites is called a sequence spectrum. Applying a transformation called a Hadamard conjugation, the sequence spectrum is transformed to the conjugate spectrum. This conjugation corrects for unobserved changes in the data, independently from the choice of phylogenetic tree. For any given phylogenetic tree with edge weights (probabilities of state change), we define a corresponding tree spectrum. The selection of a weighted phylogenetic tree from the given sequence data is made by matching the conjugate spectrum with a tree spectrum. We develop an optimality selection procedure using a least squares best fit, to find the phylogenetic tree whose tree spectrum most closely matches the conjugate spectrum. An inferred sequence spectrum can be derived from the selected tree spectrum using the inverse Hadamard conjugation to allow a comparison with the original sequence spectrum. A possible adaptation for the analysis of four-state character sequences with unequal frequencies is considered. A corresponding spectral analysis for distance data is also introduced. These analyses are illustrated with biological examples for both distance and sequence data. Spectral analysis using the Fast Hadamard transform allows optimal trees to be found for at least 20 taxa and perhaps for up to 30 taxa. The development presented here is self contained, although some mathematical proofs available elsewhere have been omitted. The analysis of sequence data is based on methods reported earlier, but the terminology and the application to distance data are new.  相似文献   

5.
This paper develops a new procedure for simultaneously performing multidimensional scaling and cluster analysis on two-way compositional data of proportions. The objective of the proposed procedure is to delineate patterns of variability in compositions across subjects by simultaneously clustering subjects into latent classes or groups and estimating a joint space of stimulus coordinates and class-specific vectors in a multidimensional space. We use a conditional mixture, maximum likelihood framework with an E-M algorithm for parameter estimation. The proposed procedure is illustrated using a compositional data set reflecting proportions of viewing time across television networks for an area sample of households.  相似文献   

6.
科技术语提取是科技术语自动处理的重要环节,对后续的机器翻译、信息检索、QA问答等任务有重要意义.传统的人工科技术语提取方法耗费大量的人力成本.而一种自动提取科技术语方法是将术语提取转化为序列标注问题,通过监督学习方法训练出标注模型,但是面临缺乏大规模科技术语标注语料库的问题.文章引入远程监督的方法来产生大规模训练标注语...  相似文献   

7.
In two-class discriminant problems, objects are allocated to one of the two classes by means of threshold rules based on discriminant functions. In this paper we propose to examine the quality of a discriminant functiong in terms of its performance curve. This curve is the plot of the two misclassification probabilities as the thresholdt assumes various real values. The role of such performance curves in evaluating and ordering discriminant functions and solving discriminant problems is presented. In particular, it is shown that: (i) the convexity of such a curve is a sufficient condition for optimal use of the information contained in the data reduced byg, and (ii)g with non-convex performance curve should be corrected by an explicitly obtained transformation.  相似文献   

8.
A probabilistic DEDICOM model was proposed for mobility tables. The model attempts to explain observed transition probabilities by a latent mobility table and a set of transition probabilities from latent classes to observed classes. The model captures asymmetry in observed mobility tables by asymmetric latent mobility tables. It may be viewed as a special case of both the latent class model and DEDICOM with special constraints. A maximum penalized likelihood (MPL) method was developed for parameter estimation. The EM algorithm was adapted for the MPL estimation. Two examples were given to illustrate the proposed method. The work reported in this paper has been supported by grant A6394 to the first author from the Natural Sciences and Engineering Research Council of Canada and by a fellowship of the Royal Netherlands Academy of Arts and Sciences to the second author. We would like to thank anonymous reviewers for their insightful comments.  相似文献   

9.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach.  相似文献   

10.
MCLUST is a software package for model-based clustering, density estimation and discriminant analysis interfaced to the S-PLUS commercial software and the R language. It implements parameterized Gaussian hierarchical clustering algorithms and the EM algorithm for parameterized Gaussian mixture models with the possible addition of a Poisson noise term. Also included are functions that combine hierarchical clustering, EM and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation, and discriminant analysis. MCLUST provides functionality for displaying and visualizing clustering and classification results. A web page with related links can be found at .  相似文献   

11.
Multiple imputation is one of the most highly recommended procedures for dealing with missing data. However, to date little attention has been paid to methods for combining the results from principal component analyses applied to a multiply imputed data set. In this paper we propose Generalized Procrustes analysis for this purpose, of which its centroid solution can be used as a final estimate for the component loadings. Convex hulls based on the loadings of the imputed data sets can be used to represent the uncertainty due to the missing data. In two simulation studies, the performance of Generalized Procrustes approach is evaluated and compared with other methods. More specifically it is studied how these methods behave when order changes of components and sign reversals of component loadings occur, such as in case of near-equal eigenvalues, or data having almost as many counterindicative items as indicative items. The simulations show that other proposed methods either may run into serious problems or are not able to adequately assess the accuracy due to the presence of missing data. However, when the above situations do not occur, all methods will provide adequate estimates for the PCA loadings.  相似文献   

12.
A natural extension of classical metric multidimensional scaling is proposed. The result is a new formulation of nonmetric multidimensional scaling in which the strain criterion is minimized subject to order constraints on the disparity variables. Innovative features of the new formulation include: the parametrization of the p-dimensional distance matrices by the positive semidefinite matrices of rank ≤p; optimization of the (squared) disparity variables, rather than the configuration coordinate variables; and a new nondegeneracy constraint, which restricts the set of (squared) disparities rather than the set of distances. Solutions are obtained using an easily implemented gradient projection method for numerical optimization. The method is applied to two published data sets.  相似文献   

13.
In this study, we consider the type of interval data summarizing the original samples (individuals) with classical point data. This type of interval data are termed interval symbolic data in a new research domain called, symbolic data analysis. Most of the existing research, such as the (centre, radius) and [lower boundary, upper boundary] representations, represent an interval using only the boundaries of the interval. However, these representations hold true only under the assumption that the individuals contained in the interval follow a uniform distribution. In practice, such representations may result in not only inconsistency with the facts, since the individuals are usually not uniformly distributed in many application aspects, but also information loss for not considering the point data within the intervals during the calculation. In this study, we propose a new representation of the interval symbolic data considering the point data contained in the intervals. Then we apply the city-block distance metric to the new representation and propose a dynamic clustering approach for interval symbolic data. A simulation experiment is conducted to evaluate the performance of our method. The results show that, when the individuals contained in the interval do not follow a uniform distribution, the proposed method significantly outperforms the Hausdorff and city-block distance based on traditional representation in the context of dynamic clustering. Finally, we give an application example on the automobile data set.  相似文献   

14.
T clusters, based on J distinct, contributory partitions (or, equivalently, J polytomous attributes). We describe a new model/algorithm for implementing this objective. The method's objective function incorporates a modified Rand measure, both in initial cluster selection and in subsequent refinement of the starting partition. The method is applied to both synthetic and real data. The performance of the proposed model is compared to latent class analysis of the same data set.  相似文献   

15.
Incremental Classification with Generalized Eigenvalues   总被引:2,自引:0,他引:2  
Supervised learning techniques are widely accepted methods to analyze data for scientific and real world problems. Most of these problems require fast and continuous acquisition of data, which are to be used in training the learning system. Therefore, maintaining such systems updated may become cumbersome. Various techniques have been devised in the field of machine learning to solve this problem. In this study, we propose an algorithm to reduce the training data to a substantially small subset of the original training data to train a generalized eigenvalue classifier. The proposed method provides a constructive way to understand the influence of new training data on an existing classification function. We show through numerical experiments that this technique prevents the overfitting problem of the earlier generalized eigenvalue classifiers, while promising a comparable performance in classification with respect to the state-of-the-art classification methods.  相似文献   

16.
We propose a new nonparametric family of oscillation heuristics for improving linear classifiers in the two-group discriminant problem. The heuristics are motivated by the intuition that the classification accuracy of a separating hyperplane can be improved through small perturbations to its slope and position, accomplished by substituting training observations near the hyperplane for those used to generate it. In an extensive simulation study, using data generated from multivariate normal distributions under a variety of conditions, the oscillation heuristics consistently improve upon the classical linear and logistic discriminant functions, as well as two published linear programming-based heuristics and a linear Support Vector Machine. Added to any of the methods above, they approach, and frequently attain, the best possible accuracy on the training samples, as determined by a mixed-integer programming (MIP) model, at a much smaller computational cost. They also improve expected accuracy on the overall populations when the populations overlap significantly and the heuristics are trained with large samples, at least in situations where the data conditions do not explicitly favor a particular classifier.  相似文献   

17.
The paper presents methodology for analyzing a set of partitions of the same set of objects, by dividing them into classes of partitions that are similar to one another. Two different definitions are given for the consensus partition which summarizes each class of partitions. The classes are obtained using either constrained or unconstrained clustering algorithms. Two applications of the methodology are described.  相似文献   

18.
Optimization Strategies for Two-Mode Partitioning   总被引:2,自引:2,他引:0  
Two-mode partitioning is a relatively new form of clustering that clusters both rows and columns of a data matrix. In this paper, we consider deterministic two-mode partitioning methods in which a criterion similar to k-means is optimized. A variety of optimization methods have been proposed for this type of problem. However, it is still unclear which method should be used, as various methods may lead to non-global optima. This paper reviews and compares several optimization methods for two-mode partitioning. Several known methods are discussed, and a new fuzzy steps method is introduced. The fuzzy steps method is based on the fuzzy c-means algorithm of Bezdek (1981) and the fuzzy steps approach of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performances of all methods are compared in a large simulation study. In our simulations, a two-mode k-means optimization method most often gives the best results. Finally, an empirical data set is used to give a practical example of two-mode partitioning. We would like to thank two anonymous referees whose comments have improved the quality of this paper. We are also grateful to Peter Verhoef for providing the data set used in this paper.  相似文献   

19.
Analysis of between-group differences using canonical variates assumes equality of population covariance matrices. Sometimes these matrices are sufficiently different for the null hypothesis of equality to be rejected, but there exist some common features which should be exploited in any analysis. The common principal component model is often suitable in such circumstances, and this model is shown to be appropriate in a practical example. Two methods for between-group analysis are proposed when this model replaces the equal dispersion matrix assumption. One method is by extension of the two-stage approach to canonical variate analysis using sequential principal component analyses as described by Campbell and Atchley (1981). The second method is by definition of a distance function between populations satisfying the common principal component model, followed by metric scaling of the resulting between-populations distance matrix. The two methods are compared with each other and with ordinary canonical variate analysis on the previously introduced data set.  相似文献   

20.
双语术语对齐库是自然语言处理领域的重要资源,对于跨语言信息检索、机器翻译等多语言应用具有重要意义。双语术语对通常是通过人工翻译或从双语平行语料中自动提取获得的。然而,人工翻译需要一定的专业知识且耗时耗力,而特定领域的双语平行语料也很难具有较大规模。但是同一领域中各种语言的单语术语库却较易获得。为此,提出一种基于两种不同语言的单语术语库自动实现术语对齐,以构建双语术语对照表的方法。该方法首先利用多个在线机器翻译引擎通过投票机制生成目标端“伪”术语,然后利用目标端“伪”术语从目标端术语库中检索得到目标端术语候选集合,最后采用基于mBERT的语义匹配算法对目标端候选集合进行重排序,从而获得最终的双语术语对。计算机科学、土木工程和医学三个领域的中英文双语术语对齐实验结果表明,该方法能够提高双语术语抽取的准确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号