共查询到20条相似文献,搜索用时 736 毫秒
1.
In this paper we show how biplot methodology can be combined with
various forms of discriminant analyses leading to highly informative visual displays of
the respective class separations. It is demonstrated that the concept of distance as
applied to discriminant analysis provides a unified approach to a wide variety of
discriminant analysis procedures that can be accommodated by just changing to an
appropriate distance metric. These changes in the distance metric are crucial for the
construction of appropriate biplots. Several new types of biplots viz. quadratic
discriminant analysis biplots for use with heteroscedastic stratified data, discriminant
subspace biplots and flexible discriminant analysis biplots are derived and their use
illustrated. Advantages of the proposed procedures are pointed out. Although biplot
methodology is in particular well suited for complementing J > 2 classes discrimination
problems its use in 2-class problems is also illustrated. 相似文献
2.
Multiclass Functional Discriminant Analysis and Its Application to Gesture Recognition 总被引:1,自引:1,他引:0
We consider applying a functional logistic discriminant procedure to the analysis of handwritten character data. Time-course
trajectories corresponding to the X and Y coordinate values of handwritten characters written in the air with one finger are
converted into a functional data set via regularized basis expansion. We then apply functional logistic modeling to classify
the functions into several classes. In order to select the values of adjusted parameters involved in the functional logistic
model, we derive a model selection criterion for evaluating models estimated by the method of regularization. Results indicate
the effectiveness of our modeling strategy in terms of prediction accuracy. 相似文献
3.
Marek Ancukiewicz 《Journal of Classification》1998,15(1):129-141
I consider a new problem of classification into n(n ≥ 2) disjoint classes based on features of unclassified data. It is assumed that the data are grouped into m(M ≥ n) disjoint sets and within each set the distribution of features is a mixture of distributions corresponding to particular
classes. Moreover, the mixing proportions should be known and form a matrix of rank n. The idea of solution is, first, to estimate feature densities in all the groups, then to solve the linear system for component
densities. The proposed classification method is asymptotically optimal, provided a consistent method of density estimation
is used. For illustration, the method is applied to determining perfusion status in myocardial infarction patients, using
creatine kinase measurements. 相似文献
4.
Spectral analysis of phylogenetic data 总被引:12,自引:0,他引:12
The spectral analysis of sequence and distance data is a new approach to phylogenetic analysis. For two-state character sequences,
the character values at a given site split the set of taxa into two subsets, a bipartition of the taxa set. The vector which
counts the relative numbers of each of these bipartitions over all sites is called a sequence spectrum. Applying a transformation
called a Hadamard conjugation, the sequence spectrum is transformed to the conjugate spectrum. This conjugation corrects for
unobserved changes in the data, independently from the choice of phylogenetic tree. For any given phylogenetic tree with edge
weights (probabilities of state change), we define a corresponding tree spectrum. The selection of a weighted phylogenetic
tree from the given sequence data is made by matching the conjugate spectrum with a tree spectrum. We develop an optimality
selection procedure using a least squares best fit, to find the phylogenetic tree whose tree spectrum most closely matches
the conjugate spectrum. An inferred sequence spectrum can be derived from the selected tree spectrum using the inverse Hadamard
conjugation to allow a comparison with the original sequence spectrum.
A possible adaptation for the analysis of four-state character sequences with unequal frequencies is considered. A corresponding
spectral analysis for distance data is also introduced. These analyses are illustrated with biological examples for both distance
and sequence data. Spectral analysis using the Fast Hadamard transform allows optimal trees to be found for at least 20 taxa
and perhaps for up to 30 taxa.
The development presented here is self contained, although some mathematical proofs available elsewhere have been omitted.
The analysis of sequence data is based on methods reported earlier, but the terminology and the application to distance data
are new. 相似文献
5.
This paper develops a new procedure for simultaneously performing multidimensional scaling and cluster analysis on two-way
compositional data of proportions. The objective of the proposed procedure is to delineate patterns of variability in compositions
across subjects by simultaneously clustering subjects into latent classes or groups and estimating a joint space of stimulus
coordinates and class-specific vectors in a multidimensional space. We use a conditional mixture, maximum likelihood framework
with an E-M algorithm for parameter estimation. The proposed procedure is illustrated using a compositional data set reflecting
proportions of viewing time across television networks for an area sample of households. 相似文献
6.
7.
Wieslaw Szczesny 《Journal of Classification》1991,8(2):201-215
In two-class discriminant problems, objects are allocated to one of the two classes by means of threshold rules based on discriminant
functions. In this paper we propose to examine the quality of a discriminant functiong in terms of its performance curve. This curve is the plot of the two misclassification probabilities as the thresholdt assumes various real values. The role of such performance curves in evaluating and ordering discriminant functions and solving
discriminant problems is presented. In particular, it is shown that: (i) the convexity of such a curve is a sufficient condition
for optimal use of the information contained in the data reduced byg, and (ii)g with non-convex performance curve should be corrected by an explicitly obtained transformation. 相似文献
8.
A probabilistic DEDICOM model was proposed for mobility tables. The model attempts to explain observed transition probabilities
by a latent mobility table and a set of transition probabilities from latent classes to observed classes. The model captures
asymmetry in observed mobility tables by asymmetric latent mobility tables. It may be viewed as a special case of both the
latent class model and DEDICOM with special constraints. A maximum penalized likelihood (MPL) method was developed for parameter
estimation. The EM algorithm was adapted for the MPL estimation. Two examples were given to illustrate the proposed method.
The work reported in this paper has been supported by grant A6394 to the first author from the Natural Sciences and Engineering
Research Council of Canada and by a fellowship of the Royal Netherlands Academy of Arts and Sciences to the second author.
We would like to thank anonymous reviewers for their insightful comments. 相似文献
9.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach. 相似文献
10.
MCLUST is a software package for model-based clustering, density estimation
and discriminant analysis interfaced to the S-PLUS commercial software and the R language.
It implements parameterized Gaussian hierarchical clustering algorithms and the
EM algorithm for parameterized Gaussian mixture models with the possible addition of a
Poisson noise term. Also included are functions that combine hierarchical clustering, EM
and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering,
density estimation, and discriminant analysis. MCLUST provides functionality for displaying
and visualizing clustering and classification results. A web page with related links can
be found at . 相似文献
11.
Multiple imputation is one of the most highly recommended procedures for dealing with missing data. However, to date little attention has been paid to methods for combining the results from principal component analyses applied to a multiply imputed data set. In this paper we propose Generalized Procrustes analysis for this purpose, of which its centroid solution can be used as a final estimate for the component loadings. Convex hulls based on the loadings of the imputed data sets can be used to represent the uncertainty due to the missing data. In two simulation studies, the performance of Generalized Procrustes approach is evaluated and compared with other methods. More specifically it is studied how these methods behave when order changes of components and sign reversals of component loadings occur, such as in case of near-equal eigenvalues, or data having almost as many counterindicative items as indicative items. The simulations show that other proposed methods either may run into serious problems or are not able to adequately assess the accuracy due to the presence of missing data. However, when the above situations do not occur, all methods will provide adequate estimates for the PCA loadings. 相似文献
12.
Michael W. Trosset 《Journal of Classification》1998,15(1):15-35
A natural extension of classical metric multidimensional scaling is proposed. The result is a new formulation of nonmetric
multidimensional scaling in which the strain criterion is minimized subject to order constraints on the disparity variables.
Innovative features of the new formulation include: the parametrization of the p-dimensional distance matrices by the positive semidefinite matrices of rank ≤p; optimization of the (squared) disparity variables, rather than the configuration coordinate variables; and a new nondegeneracy
constraint, which restricts the set of (squared) disparities rather than the set of distances. Solutions are obtained using
an easily implemented gradient projection method for numerical optimization. The method is applied to two published data
sets. 相似文献
13.
In this study, we consider the type of interval data summarizing the original samples (individuals) with classical point data. This type of interval data are termed interval symbolic data in a new research domain called, symbolic data analysis. Most of the existing research, such as the (centre, radius) and [lower boundary, upper boundary] representations, represent an interval using only the boundaries of the interval. However, these representations hold true only under the assumption that the individuals contained in the interval follow a uniform distribution. In practice, such representations may result in not only inconsistency with the facts, since the individuals are usually not uniformly distributed in many application aspects, but also information loss for not considering the point data within the intervals during the calculation. In this study, we propose a new representation of the interval symbolic data considering the point data contained in the intervals. Then we apply the city-block distance metric to the new representation and propose a dynamic clustering approach for interval symbolic data. A simulation experiment is conducted to evaluate the performance of our method. The results show that, when the individuals contained in the interval do not follow a uniform distribution, the proposed method significantly outperforms the Hausdorff and city-block distance based on traditional representation in the context of dynamic clustering. Finally, we give an application example on the automobile data set. 相似文献
14.
T clusters, based on J distinct, contributory
partitions (or, equivalently, J polytomous attributes). We describe
a new model/algorithm for implementing this objective. The method's objective
function incorporates a modified Rand measure, both in initial cluster selection
and in subsequent refinement of the starting partition. The method is applied to
both synthetic and real data. The performance of the proposed model is compared
to latent class analysis of the same data set. 相似文献
15.
Incremental Classification with Generalized Eigenvalues 总被引:2,自引:0,他引:2
Claudio Cifarelli Mario R. Guarracino Onur Seref Salvatore Cuciniello Panos M. Pardalos 《Journal of Classification》2007,24(2):205-219
Supervised learning techniques are widely accepted methods to analyze data for scientific and real world problems. Most of
these problems require fast and continuous acquisition of data, which are to be used in training the learning system. Therefore,
maintaining such systems updated may become cumbersome. Various techniques have been devised in the field of machine learning
to solve this problem. In this study, we propose an algorithm to reduce the training data to a substantially small subset
of the original training data to train a generalized eigenvalue classifier. The proposed method provides a constructive way
to understand the influence of new training data on an existing classification function. We show through numerical experiments
that this technique prevents the overfitting problem of the earlier generalized eigenvalue classifiers, while promising a
comparable performance in classification with respect to the state-of-the-art classification methods. 相似文献
16.
We propose a new nonparametric family of oscillation heuristics for improving
linear classifiers in the two-group discriminant problem. The heuristics are motivated by
the intuition that the classification accuracy of a separating hyperplane can be improved
through small perturbations to its slope and position, accomplished by substituting training
observations near the hyperplane for those used to generate it. In an extensive simulation
study, using data generated from multivariate normal distributions under a variety of conditions,
the oscillation heuristics consistently improve upon the classical linear and logistic
discriminant functions, as well as two published linear programming-based heuristics and
a linear Support Vector Machine. Added to any of the methods above, they approach, and
frequently attain, the best possible accuracy on the training samples, as determined by a
mixed-integer programming (MIP) model, at a much smaller computational cost. They
also improve expected accuracy on the overall populations when the populations overlap
significantly and the heuristics are trained with large samples, at least in situations where
the data conditions do not explicitly favor a particular classifier. 相似文献
17.
The paper presents methodology for analyzing a set of partitions of the same set of objects, by dividing them into classes
of partitions that are similar to one another. Two different definitions are given for the consensus partition which summarizes
each class of partitions. The classes are obtained using either constrained or unconstrained clustering algorithms. Two
applications of the methodology are described. 相似文献
18.
Optimization Strategies for Two-Mode Partitioning 总被引:2,自引:2,他引:0
Joost van Rosmalen Patrick J. F. Groenen Javier Trejos William Castillo 《Journal of Classification》2009,26(2):155-181
Two-mode partitioning is a relatively new form of clustering that clusters both rows and columns of a data matrix. In this
paper, we consider deterministic two-mode partitioning methods in which a criterion similar to k-means is optimized. A variety of optimization methods have been proposed for this type of problem. However, it is still unclear
which method should be used, as various methods may lead to non-global optima. This paper reviews and compares several optimization
methods for two-mode partitioning. Several known methods are discussed, and a new fuzzy steps method is introduced. The fuzzy
steps method is based on the fuzzy c-means algorithm of Bezdek (1981) and the fuzzy steps approach of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performances of all methods are compared in a large simulation study. In our simulations, a two-mode k-means optimization method most often gives the best results. Finally, an empirical data set is used to give a practical example
of two-mode partitioning.
We would like to thank two anonymous referees whose comments have improved the quality of this paper. We are also grateful
to Peter Verhoef for providing the data set used in this paper. 相似文献
19.
Between-group analysis with heterogeneous covariance matrices: The common principal component model 总被引:1,自引:1,他引:0
W. J. Krzanowski 《Journal of Classification》1990,7(1):81-98
Analysis of between-group differences using canonical variates assumes equality of population covariance matrices. Sometimes these matrices are sufficiently different for the null hypothesis of equality to be rejected, but there exist some common features which should be exploited in any analysis. The common principal component model is often suitable in such circumstances, and this model is shown to be appropriate in a practical example. Two methods for between-group analysis are proposed when this model replaces the equal dispersion matrix assumption. One method is by extension of the two-stage approach to canonical variate analysis using sequential principal component analyses as described by Campbell and Atchley (1981). The second method is by definition of a distance function between populations satisfying the common principal component model, followed by metric scaling of the resulting between-populations distance matrix. The two methods are compared with each other and with ordinary canonical variate analysis on the previously introduced data set. 相似文献
20.
双语术语对齐库是自然语言处理领域的重要资源,对于跨语言信息检索、机器翻译等多语言应用具有重要意义。双语术语对通常是通过人工翻译或从双语平行语料中自动提取获得的。然而,人工翻译需要一定的专业知识且耗时耗力,而特定领域的双语平行语料也很难具有较大规模。但是同一领域中各种语言的单语术语库却较易获得。为此,提出一种基于两种不同语言的单语术语库自动实现术语对齐,以构建双语术语对照表的方法。该方法首先利用多个在线机器翻译引擎通过投票机制生成目标端“伪”术语,然后利用目标端“伪”术语从目标端术语库中检索得到目标端术语候选集合,最后采用基于mBERT的语义匹配算法对目标端候选集合进行重排序,从而获得最终的双语术语对。计算机科学、土木工程和医学三个领域的中英文双语术语对齐实验结果表明,该方法能够提高双语术语抽取的准确率。 相似文献