共查询到20条相似文献,搜索用时 421 毫秒
1.
We present an approach, independent of the common gradient-based necessary conditions for obtaining a (locally) optimal solution,
to multidimensional scaling using the city-block distance function, and implementable in either a metric or nonmetric context.
The difficulties encountered in relying on a gradient-based strategy are first reviewed: the general weakness in indicating
a good solution that is implied by the satisfaction of the necessary condition of a zero gradient, and the possibility of
actual nonconvergence of the associated optimization strategy. To avoid the dependence on gradients for guiding the optimization
technique, an alternative iterative procedure is proposed that incorporates (a) combinatorial optimization to construct good
object orders along the chosen number of dimensions and (b) nonnegative least-squares to re-estimate the coordinates for the
objects based on the object orders. The re-estimated coordinates are used to improve upon the given object orders, which may
in turn lead to better coordinates, and so on until convergence of the entire process occurs to a (locally) optimal solution.
The approach is illustrated through several data sets on the perception of similarity of rectangles and compared to the results
obtained with a gradient-based method. 相似文献
2.
An alternating combinatorial optimization approach to fitting the INDCLUS and generalized INDCLUS models 总被引:1,自引:1,他引:0
This paper presents a general approach for fitting the ADCLUS (Shepard and Arabie 1979; Arabie, Carroll, DeSarbo, and Wind 1981), INDCLUS (Carroll and Arabie 1983), and potentially a special case of the GENNCLUS (DeSarbo 1982) models. The proposed approach, based largely on a separability property observed for the least squares loss function being optimized, offers increased efficiency and other advantages over existing approaches like MAPCLUS (Arabie and Carroll 1980) for fitting the ADCLUS model, and the INDCLUS method for fitting the INDCLUS model. The new procedure (called SINDCLUS) is applied to three sets of empirical data to demonstrate the effectiveness of the SINDCLUS methodology. Finally, some potentially useful extensions are discussed. 相似文献
3.
In this paper we propose the concept of structural similarity as a relaxation of blockmodeling in social network analysis. Most previous approaches attempt to relax the constraints on partitions, for instance, that of being a structural or regular equivalence to being approximately structural or regular, respectively. In contrast, our approach is to relax the partitions themselves: structural similarities yield similarity values instead of equivalence or non-equivalence of actors, while strictly obeying the requirement made for exact regular equivalences. Structural similarities are based on a vector space interpretation and yield efficient spectral methods that, in a more restrictive manner, have been successfully applied to difficult combinatorial problems such as graph coloring. While traditional blockmodeling approaches have to rely on local search heuristics, our framework yields algorithms that are provably optimal for specific data-generation models. Furthermore, the stability of structural similarities can be well characterized making them suitable for the analysis of noisy or dynamically changing network data. 相似文献
4.
Fionn Murtagh 《Journal of Classification》2007,24(1):3-32
We describe a new wavelet transform, for use on hierarchies or binary rooted trees. The theoretical framework of this approach
to data analysis is described. Case studies are used to further exemplify this approach. A first set of application studies
deals with data array smoothing, or filtering. A second set of application studies relates to hierarchical tree condensation.
Finally, a third study explores the wavelet decomposition, and the reproducibility of data sets such as text, including a
new perspective on the generation or computability of such data objects. 相似文献
5.
L2
-norm: (1)
dynamic programming; (2) an iterative quadratic assignment improvement
heuristic; (3) the Guttman update strategy as modified by Pliner's technique
of smoothing; (4) a nonlinear programming reformulation by Lau, Leung, and
Tse. The methods are all implemented through (freely downloadable) MATLAB
m-files; their use is illustrated by a common data set carried throughout. For
the computationally intensive dynamic programming formulation that can a
globally optimal solution, several possible computational improvements are
discussed and evaluated using (a) a transformation of a given m-function with
the MATLAB Compiler into C code and compiling the latter; (b) rewriting an
m-function and a mandatory MATLAB gateway directly in Fortran and compiling
into a MATLAB callable file; (c) comparisons of the acceleration of raw
m-files implemented under the most recent release of MATLAB Version 6.5 (and compared to the absence of such
acceleration under the previous MATLAB Version 6.1). Finally, and in contrast
to the combinatorial optimization task of identifying a best unidimensional
scaling for a given proximity matrix, an approach is given for the
confirmatory fitting of a given unidimensional scaling based only on a fixed
object ordering, and to nonmetric unidensional scaling that incorporates an
additional optimal monotonic transformation of the proximities. 相似文献
6.
框架术语学对普通术语学进行了反思和批判,是描写术语学的最新流派之一.文章介绍了该学派的三个研究焦点:主张基于事件概念组织,从而把术语的句法和组合特征纳入研究视野;考察了术语概念的多维性,突出了语境要素在术语概念表征中的重要作用;把专业语料库作为提取概念知识的主要来源,采用自下而上为主的研究路径. 相似文献
7.
Nicolas Molinari 《Journal of Classification》2007,24(2):221-234
Data in many different fields come to practitioners through a process naturally described as functional. We propose a classification
procedure of oxidation curves. Our algorithm is based on two stages: fitting the functional data by linear splines with free
knots and classifying the estimated knots which estimate useful oxidation parameters. A real data set on 57 oxidation curves
is used to illustrate our approach. 相似文献
8.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach. 相似文献
9.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate
tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust
in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without
any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set
a priori the number of clusters; in fact, the CoClust selects them by using a criterion based on the log–likelihood of a copula fit.
We test our proposal on simulated data for different dependence scenarios and compare it with a model–based clustering technique.
Finally, we show applications of the CoClust to real microarray data of breast-cancer patients. 相似文献
10.
Variable Selection for Clustering and Classification 总被引:2,自引:2,他引:0
As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets. 相似文献
11.
Faicel Chamroukhi 《Journal of Classification》2016,33(3):374-411
This paper introduces a novel mixture model-based approach to the simultaneous clustering and optimal segmentation of functional data, which are curves presenting regime changes. The proposed model consists of a finite mixture of piecewise polynomial regression models. Each piecewise polynomial regression model is associated with a cluster, and within each cluster, each piecewise polynomial component is associated with a regime (i.e., a segment). We derive two approaches to learning the model parameters: the first is an estimation approach which maximizes the observed-data likelihood via a dedicated expectation-maximization (EM) algorithm, then yielding a fuzzy partition of the curves into K clusters obtained at convergence by maximizing the posterior cluster probabilities. The second is a classification approach and optimizes a specific classification likelihood criterion through a dedicated classification expectation-maximization (CEM) algorithm. The optimal curve segmentation is performed by using dynamic programming. In the classification approach, both the curve clustering and the optimal segmentation are performed simultaneously as the CEM learning proceeds. We show that the classification approach is a probabilistic version generalizing the deterministic K-means-like algorithm proposed in Hébrail, Hugueney, Lechevallier, and Rossi (2010). The proposed approach is evaluated using simulated curves and real-world curves. Comparisons with alternatives including regression mixture models and the K-means-like algorithm for piecewise regression demonstrate the effectiveness of the proposed approach. 相似文献
12.
There have been many comparative studies of classification methods in which real datasets are used as a gauge to assess the
relative performance of the methods. Since these comparisons often yield inconclusive or limited results on how methods perform,
it is often believed that a broader approach combining these studies would shed some light on this difficult question. This
paper describes such an attempt: we have sampled the available literature and created a dataset of 5807 classification results.
We show that one of the possible ways to analyze the resulting data is an overall assessment of the classification methods,
and we present methods for that particular aim. The merits and demerits of such an approach are discussed, and conclusions
are drawn which may assist future research: we argue that the current state of the literature hardly allows large-scale investigations.
This work was sponsored by the MOD Corporate Research Programme, CISP, as part of a larger project on technology assessment.
We would like to express our appreciation to Andrew Webb for his support throughout the entire project, and to Wojtek Krzanowski
for valuable comments on a draft of this paper. We would also like to thank the anonymous referees for some very interesting
comments, some of which we hope to pursue in future work. 相似文献
13.
Viktor Brailovsky 《Journal of Classification》1988,5(1):89-99
This report extends earlier work by Brailovsky on regression theory and methodology, giving particular emphasis to function approximation for incompletely specified models. The interest here is with situations where the form of the regression relation is not known in advance. We discuss several difficulties that arise in using local approximation and linear regression methods, and propose ways to overcome these problems. To aid the data analyst in developing a suitable model, an illustrative table is derived for determining the number of initial explanatory functions justifiable for a given prespecified confidence level. The general approach formulated here is illustrated with an application to medical data. Relevance to classification and possible extensions are discussed. 相似文献
14.
Fionn Murtagh 《Journal of Classification》1998,15(2):161-183
We discuss the use of orthogonal wavelet transforms in preprocessing multivariate data for subsequent analysis, e.g., by
clustering the dimensionality reduction. Wavelet transforms allow us to introduce multiresolution approximation, and multiscale
nonparametric regression or smoothing, in a natural and integrated way into the data analysis. As will be explained in the
first part of the paper, this approach is of greatest interest for multivariate data analysis when we use (i) datasets with
ordered variables, e.g., time series, and (ii) object dimensionalities which are not too small, e.g., 16 and upwards. In
the second part of the paper, a different type of wavelet decomposition is used. Applications illustrate the powerfulness
of this new perspective on data analysis. 相似文献
15.
Mohammed Bennani Dosse Jos M.F. ten Berge Jorge N. Tendeiro 《Journal of Classification》2011,28(2):144-155
The use of Candecomp to fit scalar products in the context of Indscal is based on the assumption that, due to the symmetry
of the data matrices involved, two components matrices will become equal when Candecomp converges. Bennani Dosse and Ten Berge
(2008) have shown that, in the single component case, the assumption can only be violated at saddle points in the case of
Gramian matrices. This paper again considers Candecomp applied to symmetric matrices, but with an orthonormality constraint
on the components. This constrained version of Candecomp, when applied to symmetric matrices, has long been known under the
acronym Indort. When the data matrices are positive definite, or have become positive semidefinite due to double centering,
and the saliences are nonnegative – by chance or by constraint –, the component matrices resulting from Indort are shown to
be equal. Because Indort is also free from so-called degeneracy problems, it is a highly attractive alternative to Candecomp
in the present context. We also consider a well-known successive approach to the orthogonally constrained Indscal problem
and we compare, from simulated and real data sets, its results with those given by the simultaneous (Indort) approach. 相似文献
16.
17.
Ron Wehrens Lutgarde M.C. Buydens Chris Fraley Adrian E. Raftery 《Journal of Classification》2004,21(2):231-253
The rapid increase in the size of data sets makes clustering all the more important
to capture and summarize the information, at the same time making clustering more
difficult to accomplish. If model-based clustering is applied directly to a large data set, it
can be too slow for practical application. A simple and common approach is to first cluster
a random sample of moderate size, and then use the clustering model found in this way
to classify the remainder of the objects. We show that, in its simplest form, this method
may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling
method: several tentative models are identified from the sample instead of just one, and
several EM steps are used rather than just one E step to classify the full data set. We find
that there are significant gains from increasing the size of the sample up to about 2,000,
but not from further increases. These conclusions are based on the application of several
alternative strategies to the segmentation of three different multispectral images, and to
several simulated data sets. 相似文献
18.
A mathematical programming approach to fitting general graphs 总被引:1,自引:1,他引:0
We present an algorithm for fitting general graphs to proximity data. The algorithm utilizes a mathematical programming procedure based on a penalty function approach to impose additivity constraints upon parameters. For a user-specified number of links, the algorithm seeks to provide the connected network that gives the least-squares approximation to the proximity data with the specified number of links, allowing for linear transformations of the data. The network distance is the minimum-path-length metric for connected graphs. As a limiting case, the algorithm provides a tree where each node corresponds to an object, if the number of links is set equal to the number of objects minus one. A Monte Carlo investigation indicates that the resulting networks tend to fall within one percentage point of the least-squares solution in terms of the variance accounted for, but do not always attain this global optimum. The network model is discussed in relation to ordinal network representations (Klauer 1989) and NETSCAL (Hutchinson 1989), and applied to several well-known data sets. 相似文献
19.
We present an alternative approach to Multiple Correspondence Analysis (MCA) that is appropriate when the data consist of
ordered categorical variables. MCA displays objects (individuals, units) and variables as individual points and sets of category
points in a low-dimensional space. We propose a hybrid decomposition on the basis of the classical indicator super-matrix,
using the singular value decomposition, and the bivariate moment decomposition by orthogonal polynomials. When compared to
standard MCA, the hybrid decomposition will give the same representation of the categories of the variables, but additionally,
we obtain a clear association interpretation among the categories in terms of linear, quadratic and higher order components.
Moreover, the graphical display of the individual units will show an automatic clustering. 相似文献
20.
David Bryant 《Journal of Classification》2005,22(1):3-15
The Neighbor-Joining (NJ) method of Saitou and Nei is the most widely used
distance based method in phylogenetic analysis. Central to the method is the selection
criterion, the formula used to choose which pair of objects to amalgamate next. Here
we analyze the NJ selection criterion using an axiomatic approach. We show that any
selection criterion that is linear, permutation equivariant, statistically consistent and based
solely on distance data will give the same trees as those created by NJ. 相似文献