首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 392 毫秒
1.
Classifiers serve as tools for classifying data into classes. They directly or indirectly take a distribution of data points around a given query point into account. To express the distribution of points from the viewpoint of distances from a given point, a probability distribution mapping function is introduced here. The approximation of this function in a form of a suitable power of the distance is presented. How to state this power—the distribution mapping exponent—is described. This exponent is used for probability density estimation in high-dimensional spaces and for classification. A close relation of the exponent to a singularity exponent is discussed. It is also shown that this classifier exhibits better behavior (classification accuracy) than other kinds of classifiers for some tasks.  相似文献   

2.
Parameters are derived of distributions of three coefficients of similarity between pairs (dyads) of operational taxonomic units for multivariate binary data (presence/absence of attributes) under statistical independence. These are applied to test independence for dyadic data. Association among attributes within operational taxonomic units is allowed. It is also permissible for the two units in the dyad to be drawn from different populations having different presence probabilities of attributes. The variance of the distribution of the similarity coefficients under statistical independence is shown to be relatively large in many empirical situations. This result implies that the practical interpretation of these coefficients requires much care. An application using the Jaccard index is given for the assessment of consensus between psychotherapists and their clients.
La distribution des coefficients de similarité pour les données binaires et les attributs associés
Résumé Les paramètres de la distribution de trois coefficients de similarité entre paires d'éléments taxinomiques opérationels de données multivariables binaires (présence/absence) ont été dérivés dans l'hypothèse d'indépendance statistique. Ces paramètres sont utilisés dans un test d'indépendance pour les données dyadiques. L'existence est autorisée, dans la population d'éléments, d'une association entre plusieurs attributs. Il est également permis que les deux éléments de la dyade soient tirés de deux populations différentes, ayant différentes probabilit és quant à la présence des attributs. Dans beaucoup de situations empiriques, la variance des coefficients de similarité peut être relativement élevée dans le cas d'indépendance statistique. Par conséquence, ces coefficients doivent être interprétés avec précaution. Un exemple est donné pour le coefficient de Jaccard, qui a été employé dans une recherche sur la concordance entre des psychothérapeutes et leurs clients.
  相似文献   

3.
Cross-sectional approach for clustering time varying data   总被引:2,自引:0,他引:2  
Cluster analysis is to be performed on a three-mode data matrix of type: units, variables, time. A general model for calculating the distance between two units varying in time is proposed. One particular model is developed and used in an example concerned with clustering of 23 European countries according to the similarity of energy consumption in the years 1976–1982.Supported in part by the Research Council of Slovenia, Yugoslavia.  相似文献   

4.
Multiple choice items on tests and Likert items on surveys are ubiquitous in educational, social and behavioral science research; however, methods for analyzing of such data can be problematic. Multidimensional item response theory models are proposed that yield structured Poisson regression models for the joint distribution of responses to items. The methodology presented here extends the approach described in Anderson, Verkuilen, and Peyton (2010) that used fully conditionally specified multinomial logistic regression models as item response functions. In this paper, covariates are added as predictors of the latent variables along with covariates as predictors of location parameters. Furthermore, the models presented here incorporate ordinal information of the response options thus allowing an empirical examination of assumptions regarding the ordering and the estimation of optimal scoring of the response options. To illustrate the methodology and flexibility of the models, data from a study on aggression in middle school (Espelage, Holt, and Henkel 2004) is analyzed. The models are fit to data using SAS.  相似文献   

5.
This paper is aimed at combining both the properties of factorial subspaces and those of the Minimum Spanning Tree Algorithm (MST) to obtain a reference structure (the Maximum Path) in which the statistical units are in reduced subordering. The coordinates (factor scores) of the statistical units in a multi-factorial subspace through Principal Component Analysis are the basis for the Minimum Spanning Tree. In the MST, we single out a path of maximum length. On the Maximum Path each graduation obtained for the unit can be used as a Synthetic Index of the phenomenon analyzed. Two distinct strategies lead to the choice of the subspace in which we have the best representation of the units in the Maximum Path. The validity of the method is confirmed by results achieved in various applications to real data.  相似文献   

6.
分析了单词型术语和词组型术语在术语数据库GLOT—C中的分布,试图从理论上解释在术语系统中词组型术语占大多数的这一重要术语现象,在此基础上提出了“术语形成的经济律”,并且用FEL公式来描述这个定律。  相似文献   

7.
分析了单词型术语和词组型术语在术语数据库GLOT-C中的分布,试图从理论上解释在术语系统中词组型术语占大多数的这一重要术语现象,在此基础上提出了“术语形成的经济律”,并且用FEL公式来描述这个定律。   相似文献   

8.
A two-level data set consists of entities of a higher level (say populations), each one being composed of several units of the lower level (say individuals). Observations are made at the individual level, whereas population characteristics are aggregated from individual data. Cluster analysis with subsampling of populations is a cluster analysis based on individual data that aims at clustering populations rather than individuals. In this article, we extend existing optimality criteria for cluster analysis with subsampling of populations to deal with situations where population characteristics are not the mean of individual data. A new criterion that depends on the Mahalanobis distance is also defined. The criteria are compared using simulated examples and an ecological data set of tree species in a tropical rain forest.  相似文献   

9.
We present an alternative approach to Multiple Correspondence Analysis (MCA) that is appropriate when the data consist of ordered categorical variables. MCA displays objects (individuals, units) and variables as individual points and sets of category points in a low-dimensional space. We propose a hybrid decomposition on the basis of the classical indicator super-matrix, using the singular value decomposition, and the bivariate moment decomposition by orthogonal polynomials. When compared to standard MCA, the hybrid decomposition will give the same representation of the categories of the variables, but additionally, we obtain a clear association interpretation among the categories in terms of linear, quadratic and higher order components. Moreover, the graphical display of the individual units will show an automatic clustering.  相似文献   

10.
The distribution of lengths of phylogenetic trees under the taxonomic principle of parsimony is compared with the distribution obtained by randomizing the characters of the sequence data. This comparison allows us to define a measure of the extent to which sequence data contain significant hierarchical information. We show how to calculate this measure exactly for up to 10 taxa, and provide a good approximation for larger sets of taxa. The measure is applied to test sequences on 10 and 15 taxa.  相似文献   

11.
X is the automatic hierarchical classification of one mode (units or variables or occasions) of X on the basis of the other two. In this paper the case of OMC of units according to variables and occasions is discussed. OMC is the synthesis of a set of hierarchical classifications Delta obtained from X; e.g., the OMC of units is the consensus (synthesis) among the set of dendograms individually defined by clustering units on the basis of variables, separately for each given occasion of X. However, because Delta is often formed by a large number of classifications, it may be unrealistic that a single synthesis is representative of the entire set. In this case, subsets of similar (homegeneous) dendograms may be found in Delta so that a consensus representative of each subset may be identified. This paper proposes, PARtition and Least Squares Consensus cLassifications Analysis (PARLSCLA) of a set of r hierarchical classifications Delta. PARLSCLA identifies the best least-squares partition of Delta into m (1 <= m <= r) subsets of homogeneous dendograms and simultaneously detects the closest consensus classification (a median classification called Least Squares Consensus Dendogram (LSCD) for each subset. PARLSCLA is a generalization of the problem to find a least-squares consensus dendogram for Delta. PARLSCLA is formalized as a mixed-integer programming problem and solved with an iterative, two-step algorithm. The method proposed is applied to an empirical data set.  相似文献   

12.
We describe a new wavelet transform, for use on hierarchies or binary rooted trees. The theoretical framework of this approach to data analysis is described. Case studies are used to further exemplify this approach. A first set of application studies deals with data array smoothing, or filtering. A second set of application studies relates to hierarchical tree condensation. Finally, a third study explores the wavelet decomposition, and the reproducibility of data sets such as text, including a new perspective on the generation or computability of such data objects.  相似文献   

13.
The location model is a useful tool in parametric analysis of mixed continuous and categorical variables. In this model, the continuous variables are assumed to follow different multivariate normal distributions for each possible combination of categorical variable values. Using this model, a distance between two populations involving mixed variables can be defined. To date, however, no distributional results have been available, against which to assess the outcomes of practical applications of this distance. The null distribution of estimated distance is therefore considered in this paper, for a range of possible situations. No explicit analytical expressions are derived for this distribution, but easily implementable Monte Carlo schemes are described. These are then applied to previously cited examples.  相似文献   

14.
In this study, we consider the type of interval data summarizing the original samples (individuals) with classical point data. This type of interval data are termed interval symbolic data in a new research domain called, symbolic data analysis. Most of the existing research, such as the (centre, radius) and [lower boundary, upper boundary] representations, represent an interval using only the boundaries of the interval. However, these representations hold true only under the assumption that the individuals contained in the interval follow a uniform distribution. In practice, such representations may result in not only inconsistency with the facts, since the individuals are usually not uniformly distributed in many application aspects, but also information loss for not considering the point data within the intervals during the calculation. In this study, we propose a new representation of the interval symbolic data considering the point data contained in the intervals. Then we apply the city-block distance metric to the new representation and propose a dynamic clustering approach for interval symbolic data. A simulation experiment is conducted to evaluate the performance of our method. The results show that, when the individuals contained in the interval do not follow a uniform distribution, the proposed method significantly outperforms the Hausdorff and city-block distance based on traditional representation in the context of dynamic clustering. Finally, we give an application example on the automobile data set.  相似文献   

15.
第26届国际计量大会批准了全部以自然常数重新定义的国际单位制(SI)。新定义不再依赖任何实物计量基准,使国际单位制更具稳定性和普适性。文章介绍了国际单位制的沿革和重新定义的国际单位制的内容。  相似文献   

16.
本文基于美国专利全文数据库(US—PTO)的中国专利数据,对专利引文中的科技期刊论文的时间分布特征进行研究,选择对数正态分布模型定量化描述专利的期刊引文年龄分布。拟合效果较好:引入最大引文年龄和平均引文年龄两个参数反映专利与其引用的期刊论文之间的时间关系。  相似文献   

17.
Data in many different fields come to practitioners through a process naturally described as functional. We propose a classification procedure of oxidation curves. Our algorithm is based on two stages: fitting the functional data by linear splines with free knots and classifying the estimated knots which estimate useful oxidation parameters. A real data set on 57 oxidation curves is used to illustrate our approach.  相似文献   

18.
We describe a simple time series transformation to detect differences in series that can be accurately modelled as stationary autoregressive (AR) processes. The transformation involves forming the histogram of above and below the mean run lengths. The run length (RL) transformation has the benefits of being very fast, compact and updatable for new data in constant time. Furthermore, it can be generated directly from data that has already been highly compressed. We first establish the theoretical asymptotic relationship between run length distributions and AR models through consideration of the zero crossing probability and the distribution of runs. We benchmark our transformation against two alternatives: the truncated Autocorrelation function (ACF) transform and the AR transformation, which involves the standard method of fitting the partial autocorrelation coefficients with the Durbin-Levinson recursions and using the Akaike Information Criterion stopping procedure. Whilst optimal in the idealized scenario, representing the data in these ways is time consuming and the representation cannot be updated online for new data. We show that for classification problems the accuracy obtained through using the run length distribution tends towards that obtained from using the full fitted models. We then propose three alternative distance measures for run length distributions based on Gower’s general similarity coefficient, the likelihood ratio and dynamic time warping (DTW). Through simulated classification experiments we show that a nearest neighbour distance based on DTW converges to the optimal faster than classifiers based on Euclidean distance, Gower’s coefficient and the likelihood ratio. We experiment with a variety of classifiers and demonstrate that although the RL transform requires more data than the best performing classifier to achieve the same accuracy as AR or ACF, this factor is at worst non-increasing with the series length, m, whereas the relative time taken to fit AR and ACF increases with m. We conclude that if the data is stationary and can be suitably modelled by an AR series, and if time is an important factor in reaching a discriminatory decision, then the run length distribution transform is a simple and effective transformation to use.  相似文献   

19.
Two algorithms for pyramidal classification — a generalization of hierarchical classification — are presented that can work with incomplete dissimilarity data. These approaches — a modification of the pyramidal ascending classification algorithm and a least squares based penalty method — are described and compared using two different types of complete dissimilarity data in which randomly chosen dissimilarities are assumed missing and the non-missing ones are subjected to random error. We also consider relationships between hierarchical classification and pyramidal classification solutions when both are based on incomplete dissimilarity data.  相似文献   

20.
Framework of this paper is statistical data editing, specifically how to edit or impute missing or contradictory data and how to merge two independent data sets presenting some lack of information. Assuming a missing at random mechanism, this paper provides an accurate tree-based methodology for both missing data imputation and data fusion that is justified within the Statistical Learning Theory of Vapnik. It considers both an incremental variable imputation method to improve computational efficiency as well as boosted trees to gain in prediction accuracy with respect to other methods. As a result, the best approximation of the structural risk (also known as irreducible error) is reached, thus reducing at minimum the generalization (or prediction) error of imputation. Moreover, it is distribution free, it holds independently of the underlying probability law generating missing data values. Performance analysis is discussed considering simulation case studies and real world applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号