首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 63 毫秒
We construct a weighted Euclidean distance that approximates any distance or dissimilarity measure between individuals that is based on a rectangular cases-by-variables data matrix. In contrast to regular multidimensional scaling methods for dissimilarity data, our approach leads to biplots of individuals and variables while preserving all the good properties of dimension-reduction methods that are based on the singular-value decomposition. The main benefits are the decomposition of variance into components along principal axes, which provide the numerical diagnostics known as contributions, and the estimation of nonnegative weights for each variable. The idea is inspired by the distance functions used in correspondence analysis and in principal component analysis of standardized data, where the normalizations inherent in the distances can be considered as differential weighting of the variables. In weighted Euclidean biplots, we allow these weights to be unknown parameters, which are estimated from the data to maximize the fit to the chosen distances or dissimilarities. These weights are estimated using a majorization algorithm. Once this extra weight-estimation step is accomplished, the procedure follows the classical path in decomposing the matrix and displaying its rows and columns in biplots.  相似文献   

The class of Schoenberg transformations, embedding Euclidean distances into higher dimensional Euclidean spaces, is presented, and derived from theorems on positive definite and conditionally negative definite matrices. Original results on the arc lengths, angles and curvature of the transformations are proposed, and visualized on artificial data sets by classical multidimensional scaling. A distance-based discriminant algorithm and a robust multidimensional centroid estimate illustrate the theory, closely connected to the Gaussian kernels of Machine Learning.  相似文献   

A validation study of a variable weighting algorithm for cluster analysis   总被引:1,自引:0,他引:1  
De Soete (1986, 1988) proposed a variable weighting procedure when Euclidean distance is used as the dissimilarity measure with an ultrametric hierarchical clustering method. The algorithm produces weighted distances which approximate ultrametric distances as closely as possible in a least squares sense. The present simulation study examined the effectiveness of the De Soete procedure for an applications problem for which it was not originally intended. That is, to determine whether or not the algorithm can be used to reduce the influence of variables which are irrelevant to the clustering present in the data. The simulation study examined the ability of the procedure to recover a variety of known underlying cluster structures. The results indicate that the algorithm is effective in identifying extraneous variables which do not contribute information about the true cluster structure. Weights near 0.0 were typically assigned to such extraneous variables. Furthermore, the variable weighting procedure was not adversely effected by the presence of other forms of error in the data. In general, it is recommended that the variable weighting procedure be used for applied analyses when Euclidean distance is employed with ultrametric hierarchical clustering methods.  相似文献   

In this study, we consider the type of interval data summarizing the original samples (individuals) with classical point data. This type of interval data are termed interval symbolic data in a new research domain called, symbolic data analysis. Most of the existing research, such as the (centre, radius) and [lower boundary, upper boundary] representations, represent an interval using only the boundaries of the interval. However, these representations hold true only under the assumption that the individuals contained in the interval follow a uniform distribution. In practice, such representations may result in not only inconsistency with the facts, since the individuals are usually not uniformly distributed in many application aspects, but also information loss for not considering the point data within the intervals during the calculation. In this study, we propose a new representation of the interval symbolic data considering the point data contained in the intervals. Then we apply the city-block distance metric to the new representation and propose a dynamic clustering approach for interval symbolic data. A simulation experiment is conducted to evaluate the performance of our method. The results show that, when the individuals contained in the interval do not follow a uniform distribution, the proposed method significantly outperforms the Hausdorff and city-block distance based on traditional representation in the context of dynamic clustering. Finally, we give an application example on the automobile data set.  相似文献   

Probabilistic D-Clustering   总被引:1,自引:1,他引:0  
We present a new iterative method for probabilistic clustering of data. Given clusters, their centers and the distances of data points from these centers, the probability of cluster membership at any point is assumed inversely proportional to the distance from (the center of) the cluster in question. This assumption is our working principle. The method is a generalization, to several centers, of theWeiszfeld method for solving the Fermat–Weber location problem. At each iteration, the distances (Euclidean, Mahalanobis, etc.) from the cluster centers are computed for all data points, and the centers are updated as convex combinations of these points, with weights determined by the above principle. Computations stop when the centers stop moving.  相似文献   

The majorization method for multidimensional scaling with Kruskal's STRESS has been limited to Euclidean distances only. Here we extend the majorization algorithm to deal with Minkowski distances with 1≤p≤2 and suggest an algorithm that is partially based on majorization forp outside this range. We give some convergence proofs and extend the zero distance theorem of De Leeuw (1984) to Minkowski distances withp>1.  相似文献   

In this paper, dissimilarity relations are defined on triples rather than on dyads. We give a definition of a three-way distance analogous to that of the ordinary two-way distance. It is shown, as a straightforward generalization, that it is possible to define three-way ultrametric, three-way star, and three-way Euclidean distances. Special attention is paid to a model called the semi-perimeter model. We construct new methods analogous to the existing ones for ordinary distances, for example: principal coordinates analysis, the generalized Prim (1957) algorithm, hierarchical cluster analysis.  相似文献   

Multidimensional scaling in the city-block metric: A combinatorial approach   总被引:1,自引:1,他引:0  
We present an approach, independent of the common gradient-based necessary conditions for obtaining a (locally) optimal solution, to multidimensional scaling using the city-block distance function, and implementable in either a metric or nonmetric context. The difficulties encountered in relying on a gradient-based strategy are first reviewed: the general weakness in indicating a good solution that is implied by the satisfaction of the necessary condition of a zero gradient, and the possibility of actual nonconvergence of the associated optimization strategy. To avoid the dependence on gradients for guiding the optimization technique, an alternative iterative procedure is proposed that incorporates (a) combinatorial optimization to construct good object orders along the chosen number of dimensions and (b) nonnegative least-squares to re-estimate the coordinates for the objects based on the object orders. The re-estimated coordinates are used to improve upon the given object orders, which may in turn lead to better coordinates, and so on until convergence of the entire process occurs to a (locally) optimal solution. The approach is illustrated through several data sets on the perception of similarity of rectangles and compared to the results obtained with a gradient-based method.  相似文献   

A study of standardization of variables in cluster analysis   总被引:2,自引:2,他引:0  
A methodological problem in applied clustering involves the decision of whether or not to standardize the input variables prior to the computation of a Euclidean distance dissimilarity measure. Existing results have been mixed with some studies recommending standardization and others suggesting that it may not be desirable. The existence of numerous approaches to standardization complicates the decision process. The present simulation study examined the standardization problem. A variety of data structures were generated which varied the intercluster spacing and the scales for the variables. The data sets were examined in four different types of error environments. These involved error free data, error perturbed distances, inclusion of outliers, and the addition of random noise dimensions. Recovery of true cluster structure as found by four clustering methods was measured at the correct partition level and at reduced levels of coverage. Results for eight standardization strategies are presented. It was found that those approaches which standardize by division by the range of the variable gave consistently superior recovery of the underlying cluster structure. The result held over different error conditions, separation distances, clustering methods, and coverage levels. The traditionalz-score transformation was found to be less effective in several situations.  相似文献   

Classical unidimensional scaling provides a difficult combinatorial task. A procedure formulated as a nonlinear programming (NLP) model is proposed to solve this problem. The new method can be implemented with standard mathematical programming software. Unlike the traditional procedures that minimize either the sum of squared error (L 2 norm) or the sum pf absolute error (L 1 norm), the proposed method can minimize the error based on any L p norm for 1 ≤p < ∞. Extensions of the NLP formulation to address a multidimensional scaling problem under the city-block model are also discussed.  相似文献   

An approach is presented for analyzing a heterogeneous set of categorical variables assumed to form a limited number of homogeneous subsets. The variables generate a particular set of proximities between the objects in the data matrix, and the objective of the analysis is to represent the objects in lowdimensional Euclidean spaces, where the distances approximate these proximities. A least squares loss function is minimized that involves three major components: a) the partitioning of the heterogeneous variables into homogeneous subsets; b) the optimal quantification of the categories of the variables, and c) the representation of the objects through multiple multidimensional scaling tasks performed simultaneously. An important aspect from an algorithmic point of view is in the use of majorization. The use of the procedure is demonstrated by a typical example of possible application, i.e., the analysis of categorical data obtained in a free-sort task. The results of points of view analysis are contrasted with a standard homogeneity analysis, and the stability is studied through a Jackknife analysis.  相似文献   

This paper presents the development of a new methodology which simultaneously estimates in a least-squares fashion both an ultrametric tree and respective variable weightings for profile data that have been converted into (weighted) Euclidean distances. We first review the relevant classification literature on this topic. The new methodology is presented including the alternating least-squares algorithm used to estimate the parameters. The method is applied to a synthetic data set with known structure as a test of its operation. An application of this new methodology to ethnic group rating data is also discussed. Finally, extensions of the procedure to model additive, multiple, and three-way trees are mentioned.The first author is supported as Bevoegdverklaard Navorser of the Belgian Nationaal Fonds voor Wetenschappelijk Onderzoek.  相似文献   

By associating a whole distance matrix with a single point in a parameter space, a family of matrices (e.g., all those obeying the triangle inequality) can be shown as a cloud of points. Pictures of the cloud form a family portrait, and its characteristic shape and interrelationship with the portraits of other families can be explored. Critchley (unpublished) used this approach to illustrate, for distances between three points, algebraic results on the nesting relations between various metrics. In this paper, these diagrams are further investigated and then generalized. In the first generalization, projective geometry is used to allow the geometric representation of Additive Mixture, Additive Constant, and Missing Data problems. Then the six-dimensional portraits of four-point distance matrices are studied, revealing differences between the Euclidean, Additive Tree, and General Metric families. The paper concludes with caveats and insights concerning families of generaln-point metric matrices.  相似文献   

Metric and Euclidean properties of dissimilarity coefficients   总被引:8,自引:8,他引:0  
We assemble here properties of certain dissimilarity coefficients and are specially concerned with their metric and Euclidean status. No attempt is made to be exhaustive as far as coefficients are concerned, but certain mathematical results that we have found useful are presented and should help establish similar properties for other coefficients. The response to different types of data is investigated, leading to guidance on the choice of an appropriate coefficient.The authors wish to thank the referees, one of whom did a magnificent job in painstakingly checking the detailed algebra and detecting several slips.  相似文献   

When a dissimilarity matrix cannot be represented in a Euclidean space, it is possible to make it Euclidean by means of suitable transformations of the original dissimilarity values. In this paper we discuss some interesting properties of a class of transformations based on adding a specific squared Euclidean distance to the initial dissimilarity. An erratum to this article is available at .  相似文献   

One key point in cluster analysis is to determine a similarity or dissimilarity measure between data objects. When working with time series, the concept of similarity can be established in different ways. In this paper, several non-parametric statistics originally designed to test the equality of the log-spectra of two stochastic processes are proposed as dissimilarity measures between time series data. Their behavior in time series clustering is analyzed throughout a simulation study, and compared with the performance of several model-free and model-based dissimilarity measures. Up to three different classification settings were considered: (i) to distinguish between stationary and non-stationary time series, (ii) to classify different ARMA processes and (iii) to classify several non-linear time series models. As it was expected, the performance of a particular dissimilarity metric strongly depended on the type of processes subjected to clustering. Among all the measures studied, the nonparametric distances showed the most robust behavior.  相似文献   

We propose a development stemming from Roux (1988). The principle is progressively to modify the dissimilarities so that every quadruple satisfies not only the additive inequality, as in Roux's method, but also all triangle inequalities. Our method thus ensures that the results are tree distances even when the observed dissimilarities are nonmetric. The method relies on the analytic solution of the least-squares projection onto a tree distance of the dissimilarities attached to a single quadruple. This goal is achieved by using geometric reasoning which also enables an easy proof of algorithm's convergence. This proof is simpler and more complete than that of Roux (1988) and applies to other similar reduction methods based on local least-squares projection. The method is illustrated using Case's (1978) data. Finally, we provide a comparative study with simulated data and show that our method compares favorably with that of Studier and Keppler (1988) which follows in the ADDTREE tradition (Sattath and Tversky 1977). Moreover, this study seems to indicate that our method's results are generally close to the global optimum according to variance accounted for.We offer sincere thanks to Gilles Caraux, Bernard Fichet, Alain Guénoche, and Maurice Roux for helpful discussions, advice, and for reading the preliminary versions of this paper. We are grateful to three anonymous referees and to the editor for many insightful comments. This research was supported in part by the GREG and the IA2 network.  相似文献   

A natural extension of classical metric multidimensional scaling is proposed. The result is a new formulation of nonmetric multidimensional scaling in which the strain criterion is minimized subject to order constraints on the disparity variables. Innovative features of the new formulation include: the parametrization of the p-dimensional distance matrices by the positive semidefinite matrices of rank ≤p; optimization of the (squared) disparity variables, rather than the configuration coordinate variables; and a new nondegeneracy constraint, which restricts the set of (squared) disparities rather than the set of distances. Solutions are obtained using an easily implemented gradient projection method for numerical optimization. The method is applied to two published data sets.  相似文献   

康德的空间学说是整个理论哲学批判的奠基石。一般经验的可能性、综合先天命题学说的确立、知识构成的二分、现象和物自体的划界等,确切地表明了空间学说在康德理论哲学中的奠基作用。维也纳学派的领袖和理论奠基人莫里茨.石里克分别在"空间的观念性、嵌入和心理-物理问题"(1916)、"当代物理学中的空间与时间"(1917)两篇论文以及《自然哲学》、《普通认识论》、《论哲学的问题及其相互关联》等著作中,对康德的空间学说展开猛烈的批评,批评性的文字散布在从1910至1936年间的论文和著作中。石里克的这些批判性和建构性的论证,不仅瓦解了康德对空间-欧式几何、知识-直观的旧有教条,也提供了科学与哲学关系的新见解,是二十世纪知识理论研究和心灵哲学中值得重视的一个环节。  相似文献   

This paper proposes a measure of spatial homogeneity for sets of d-dimensional points based on nearest neighbor distances. Tests for spatial uniformity are examined which assess the tendency of the entire data set to aggregate and evaluate the character of individual clusters. The sizes and powers of three statistical tests of uniformity against aggregation, regularity, and unimodality are studied to determine robustness. The paper also studies the effects of normalization and incorrect prior information. A percentile frame sampling procedure is proposed that does not require a sampling window but is superior to a toroidal frame and to buffer zone sampling in particular situations. Examples test two data sets for homogeneity and search the results of a hierarchical clustering for homogeneous clusters.This work was partially supported by NSF Grant ECS-8300204.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号