首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Multidimensional scaling in the city-block metric: A combinatorial approach   总被引:1,自引:1,他引:0  
We present an approach, independent of the common gradient-based necessary conditions for obtaining a (locally) optimal solution, to multidimensional scaling using the city-block distance function, and implementable in either a metric or nonmetric context. The difficulties encountered in relying on a gradient-based strategy are first reviewed: the general weakness in indicating a good solution that is implied by the satisfaction of the necessary condition of a zero gradient, and the possibility of actual nonconvergence of the associated optimization strategy. To avoid the dependence on gradients for guiding the optimization technique, an alternative iterative procedure is proposed that incorporates (a) combinatorial optimization to construct good object orders along the chosen number of dimensions and (b) nonnegative least-squares to re-estimate the coordinates for the objects based on the object orders. The re-estimated coordinates are used to improve upon the given object orders, which may in turn lead to better coordinates, and so on until convergence of the entire process occurs to a (locally) optimal solution. The approach is illustrated through several data sets on the perception of similarity of rectangles and compared to the results obtained with a gradient-based method.  相似文献   

2.
This paper presents a general approach for fitting the ADCLUS (Shepard and Arabie 1979; Arabie, Carroll, DeSarbo, and Wind 1981), INDCLUS (Carroll and Arabie 1983), and potentially a special case of the GENNCLUS (DeSarbo 1982) models. The proposed approach, based largely on a separability property observed for the least squares loss function being optimized, offers increased efficiency and other advantages over existing approaches like MAPCLUS (Arabie and Carroll 1980) for fitting the ADCLUS model, and the INDCLUS method for fitting the INDCLUS model. The new procedure (called SINDCLUS) is applied to three sets of empirical data to demonstrate the effectiveness of the SINDCLUS methodology. Finally, some potentially useful extensions are discussed.  相似文献   

3.
In this paper we propose the concept of structural similarity as a relaxation of blockmodeling in social network analysis. Most previous approaches attempt to relax the constraints on partitions, for instance, that of being a structural or regular equivalence to being approximately structural or regular, respectively. In contrast, our approach is to relax the partitions themselves: structural similarities yield similarity values instead of equivalence or non-equivalence of actors, while strictly obeying the requirement made for exact regular equivalences. Structural similarities are based on a vector space interpretation and yield efficient spectral methods that, in a more restrictive manner, have been successfully applied to difficult combinatorial problems such as graph coloring. While traditional blockmodeling approaches have to rely on local search heuristics, our framework yields algorithms that are provably optimal for specific data-generation models. Furthermore, the stability of structural similarities can be well characterized making them suitable for the analysis of noisy or dynamically changing network data.  相似文献   

4.
We describe a new wavelet transform, for use on hierarchies or binary rooted trees. The theoretical framework of this approach to data analysis is described. Case studies are used to further exemplify this approach. A first set of application studies deals with data array smoothing, or filtering. A second set of application studies relates to hierarchical tree condensation. Finally, a third study explores the wavelet decomposition, and the reproducibility of data sets such as text, including a new perspective on the generation or computability of such data objects.  相似文献   

5.
L2 -norm: (1) dynamic programming; (2) an iterative quadratic assignment improvement heuristic; (3) the Guttman update strategy as modified by Pliner's technique of smoothing; (4) a nonlinear programming reformulation by Lau, Leung, and Tse. The methods are all implemented through (freely downloadable) MATLAB m-files; their use is illustrated by a common data set carried throughout. For the computationally intensive dynamic programming formulation that can a globally optimal solution, several possible computational improvements are discussed and evaluated using (a) a transformation of a given m-function with the MATLAB Compiler into C code and compiling the latter; (b) rewriting an m-function and a mandatory MATLAB gateway directly in Fortran and compiling into a MATLAB callable file; (c) comparisons of the acceleration of raw m-files implemented under the most recent release of MATLAB Version 6.5 (and compared to the absence of such acceleration under the previous MATLAB Version 6.1). Finally, and in contrast to the combinatorial optimization task of identifying a best unidimensional scaling for a given proximity matrix, an approach is given for the confirmatory fitting of a given unidimensional scaling based only on a fixed object ordering, and to nonmetric unidensional scaling that incorporates an additional optimal monotonic transformation of the proximities.  相似文献   

6.
框架术语学对普通术语学进行了反思和批判,是描写术语学的最新流派之一.文章介绍了该学派的三个研究焦点:主张基于事件概念组织,从而把术语的句法和组合特征纳入研究视野;考察了术语概念的多维性,突出了语境要素在术语概念表征中的重要作用;把专业语料库作为提取概念知识的主要来源,采用自下而上为主的研究路径.  相似文献   

7.
Data in many different fields come to practitioners through a process naturally described as functional. We propose a classification procedure of oxidation curves. Our algorithm is based on two stages: fitting the functional data by linear splines with free knots and classifying the estimated knots which estimate useful oxidation parameters. A real data set on 57 oxidation curves is used to illustrate our approach.  相似文献   

8.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach.  相似文献   

9.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.  相似文献   

10.
Variable Selection for Clustering and Classification   总被引:2,自引:2,他引:0  
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.  相似文献   

11.
This paper introduces a novel mixture model-based approach to the simultaneous clustering and optimal segmentation of functional data, which are curves presenting regime changes. The proposed model consists of a finite mixture of piecewise polynomial regression models. Each piecewise polynomial regression model is associated with a cluster, and within each cluster, each piecewise polynomial component is associated with a regime (i.e., a segment). We derive two approaches to learning the model parameters: the first is an estimation approach which maximizes the observed-data likelihood via a dedicated expectation-maximization (EM) algorithm, then yielding a fuzzy partition of the curves into K clusters obtained at convergence by maximizing the posterior cluster probabilities. The second is a classification approach and optimizes a specific classification likelihood criterion through a dedicated classification expectation-maximization (CEM) algorithm. The optimal curve segmentation is performed by using dynamic programming. In the classification approach, both the curve clustering and the optimal segmentation are performed simultaneously as the CEM learning proceeds. We show that the classification approach is a probabilistic version generalizing the deterministic K-means-like algorithm proposed in Hébrail, Hugueney, Lechevallier, and Rossi (2010). The proposed approach is evaluated using simulated curves and real-world curves. Comparisons with alternatives including regression mixture models and the K-means-like algorithm for piecewise regression demonstrate the effectiveness of the proposed approach.  相似文献   

12.
There have been many comparative studies of classification methods in which real datasets are used as a gauge to assess the relative performance of the methods. Since these comparisons often yield inconclusive or limited results on how methods perform, it is often believed that a broader approach combining these studies would shed some light on this difficult question. This paper describes such an attempt: we have sampled the available literature and created a dataset of 5807 classification results. We show that one of the possible ways to analyze the resulting data is an overall assessment of the classification methods, and we present methods for that particular aim. The merits and demerits of such an approach are discussed, and conclusions are drawn which may assist future research: we argue that the current state of the literature hardly allows large-scale investigations. This work was sponsored by the MOD Corporate Research Programme, CISP, as part of a larger project on technology assessment. We would like to express our appreciation to Andrew Webb for his support throughout the entire project, and to Wojtek Krzanowski for valuable comments on a draft of this paper. We would also like to thank the anonymous referees for some very interesting comments, some of which we hope to pursue in future work.  相似文献   

13.
This report extends earlier work by Brailovsky on regression theory and methodology, giving particular emphasis to function approximation for incompletely specified models. The interest here is with situations where the form of the regression relation is not known in advance. We discuss several difficulties that arise in using local approximation and linear regression methods, and propose ways to overcome these problems. To aid the data analyst in developing a suitable model, an illustrative table is derived for determining the number of initial explanatory functions justifiable for a given prespecified confidence level. The general approach formulated here is illustrated with an application to medical data. Relevance to classification and possible extensions are discussed.  相似文献   

14.
We discuss the use of orthogonal wavelet transforms in preprocessing multivariate data for subsequent analysis, e.g., by clustering the dimensionality reduction. Wavelet transforms allow us to introduce multiresolution approximation, and multiscale nonparametric regression or smoothing, in a natural and integrated way into the data analysis. As will be explained in the first part of the paper, this approach is of greatest interest for multivariate data analysis when we use (i) datasets with ordered variables, e.g., time series, and (ii) object dimensionalities which are not too small, e.g., 16 and upwards. In the second part of the paper, a different type of wavelet decomposition is used. Applications illustrate the powerfulness of this new perspective on data analysis.  相似文献   

15.
The use of Candecomp to fit scalar products in the context of Indscal is based on the assumption that, due to the symmetry of the data matrices involved, two components matrices will become equal when Candecomp converges. Bennani Dosse and Ten Berge (2008) have shown that, in the single component case, the assumption can only be violated at saddle points in the case of Gramian matrices. This paper again considers Candecomp applied to symmetric matrices, but with an orthonormality constraint on the components. This constrained version of Candecomp, when applied to symmetric matrices, has long been known under the acronym Indort. When the data matrices are positive definite, or have become positive semidefinite due to double centering, and the saliences are nonnegative – by chance or by constraint –, the component matrices resulting from Indort are shown to be equal. Because Indort is also free from so-called degeneracy problems, it is a highly attractive alternative to Candecomp in the present context. We also consider a well-known successive approach to the orthogonally constrained Indscal problem and we compare, from simulated and real data sets, its results with those given by the simultaneous (Indort) approach.  相似文献   

16.
在可行能力方法的框架下,对世界价值观调查数据的中国部分进行分析,归纳总结出七种具有代表性的可行能力。以主观幸福感作为衡量人们幸福生活的标准,可行能力指标作为解释变量,采用描述性统计分析和回归分析进行检验。结果显示:具有代表性的可行能力对中国国民的主观幸福感有显著影响,其中经济满意、健康状态和自由选择贡献较大,应给予优先关注。教育需要在收入得到保证的前提下发挥作用。  相似文献   

17.
The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical application. A simple and common approach is to first cluster a random sample of moderate size, and then use the clustering model found in this way to classify the remainder of the objects. We show that, in its simplest form, this method may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling method: several tentative models are identified from the sample instead of just one, and several EM steps are used rather than just one E step to classify the full data set. We find that there are significant gains from increasing the size of the sample up to about 2,000, but not from further increases. These conclusions are based on the application of several alternative strategies to the segmentation of three different multispectral images, and to several simulated data sets.  相似文献   

18.
A mathematical programming approach to fitting general graphs   总被引:1,自引:1,他引:0  
We present an algorithm for fitting general graphs to proximity data. The algorithm utilizes a mathematical programming procedure based on a penalty function approach to impose additivity constraints upon parameters. For a user-specified number of links, the algorithm seeks to provide the connected network that gives the least-squares approximation to the proximity data with the specified number of links, allowing for linear transformations of the data. The network distance is the minimum-path-length metric for connected graphs. As a limiting case, the algorithm provides a tree where each node corresponds to an object, if the number of links is set equal to the number of objects minus one. A Monte Carlo investigation indicates that the resulting networks tend to fall within one percentage point of the least-squares solution in terms of the variance accounted for, but do not always attain this global optimum. The network model is discussed in relation to ordinal network representations (Klauer 1989) and NETSCAL (Hutchinson 1989), and applied to several well-known data sets.  相似文献   

19.
We present an alternative approach to Multiple Correspondence Analysis (MCA) that is appropriate when the data consist of ordered categorical variables. MCA displays objects (individuals, units) and variables as individual points and sets of category points in a low-dimensional space. We propose a hybrid decomposition on the basis of the classical indicator super-matrix, using the singular value decomposition, and the bivariate moment decomposition by orthogonal polynomials. When compared to standard MCA, the hybrid decomposition will give the same representation of the categories of the variables, but additionally, we obtain a clear association interpretation among the categories in terms of linear, quadratic and higher order components. Moreover, the graphical display of the individual units will show an automatic clustering.  相似文献   

20.
The Neighbor-Joining (NJ) method of Saitou and Nei is the most widely used distance based method in phylogenetic analysis. Central to the method is the selection criterion, the formula used to choose which pair of objects to amalgamate next. Here we analyze the NJ selection criterion using an axiomatic approach. We show that any selection criterion that is linear, permutation equivariant, statistically consistent and based solely on distance data will give the same trees as those created by NJ.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号