共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper, we propose a bicriterion objective function for clustering a given set ofN entities, which minimizes [d–(1–)s], where 01, andd ands are the diameter and the split of the clustering, respectively. When =1, the problem reduces to minimum diameter clustering, and when =0, maximum split clustering. We show that this objective provides an effective way to compromise between the two often conflicting criteria. While the problem is NP-hard in general, a polynomial algorithm with the worst-case time complexityO(N
2) is devised to solve the bipartition version. This algorithm actually gives all the Pareto optimal bipartitions with respect to diameter and split, and it can be extended to yield an efficient divisive hierarchical scheme. An extension of the approach to the objective [(d
1+d
2)–2(1–)s] is also proposed, whered
1 andd
2 are diameters of the two clusters of a bipartition.This research was supported in part by the National Science and Engineering Research Council of Canada (Grant OGP 0104900). The authors wish to thank two anonymous referees, whose detailed comments on earlier drafts improved the paper. 相似文献
2.
ConsiderN entities to be classified, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We study here the average split of the clusters or, equivalently, the sum of splits. A (N
2) algorithm is provided to determine maximum sum-of-splits partitions intoM clusters for allM betweenN – 1 and 2, using the dual graph of the single-linkage dendrogram.
Résumé SoientN objets à classifier et une matrice de dissimilarit és entre paires de ces objets. L'écart d'une classe est la plus petite dissimilarité entre un objet de cette classe et un objet en dehors d'elle. L'algorithme du lien simple fournit des partitions enM classes dont le plus petit écart est maximum. On étudie l'écart moyen des classes, ou, ce qui est équivalent, la somme des écarts. On propose un algorithme en (N 2) pour déterminer des partitions enM classes dont la somme des écarts est maximum pourM allant deN – 1 à 2, basé sur le graphe dual du dendrogramme de la méthode du lien simple.相似文献
3.
The Self-Organizing Feature Maps (SOFM; Kohonen 1984) algorithm is a well-known example of unsupervised learning in connectionism and is a clustering method closely related to the k-means. Generally the data set is available before running the algorithm and the clustering problem can be approached by an inertia criterion optimization. In this paper we consider the probabilistic approach to this problem. We propose a new algorithm based on the Expectation Maximization principle (EM; Dempster, Laird, and Rubin 1977). The new method can be viewed as a Kohonen type of EM and gives a better insight into the SOFM according to constrained clustering. We perform numerical experiments and compare our results with the standard Kohonen approach. 相似文献
4.
The problem of determining a partition of a given set ofN entities intoM clusters such that the sum of the diameters of these clusters is minimum has been studied by Brucker (1978). He proved that it is NP-complete forM3 and mentioned that its complexity was unknown forM=2. We provide anO(N
3 logN) algorithm for this latter case. Moreover, we show that determining a partition into two clusters which minimizes any given function of the diameters can be done inO(N
5) time.Acknowledgments: This research was supported by the Air Force Office of Scientific Research Grant AFOSR 0271 to Rutgers University. We are grateful to Yves Crama for several insightful remarks and to an anonymous referee for detailed comments. 相似文献
5.
Direct multicriteria clustering algorithms 总被引:1,自引:0,他引:1
In a multicriteria clustering problem, optimization over more than one criterion is required. The problem can be treated in
different ways: by reduction to a clustering problem with the single criterion obtained as a combination of the given criteria;
by constrained clustering algorithms where a selected critetion is considered as the clustering criterion and all others determine
the constraints; or by direct algorithms. In this paper two types of direct algorithms for solving multicriteria clustering
problem are proposed: the modified relocation algorithm, and the modified agglomerative algorithm. Different elaborations
of these two types of algorithms are discussed and compared. Finally, two applications of the proposed algorithms are presented.
Elaborated version of the talks presented at the First Conference of the International Federation of Classification Societies,
Aachen, 1987, at the International Conference on Social Science Methodology, Dubrovnik, 1988, and at the Second Conference
of the International Federation of Classification Societies, Charlottesville, 1989.
This work was supported in part by the Research Council of Slovenia. 相似文献
6.
A mathematical programming algorithm is developed for fitting ultrametric or additive trees to proximity data where external constraints are imposed on the topology of the tree. The two procedures minimize a least squares loss function. The method is illustrated on both synthetic and real data. A constrained ultrametric tree analysis was performed on similarities between 32 subjects based on preferences for ten odors, while a constrained additive tree analysis was carried out on some proximity data between kinship terms. Finally, some extensions of the methodology to other tree fitting procedures are mentioned.The first author is supported as Bevoegdverklaard Navorser of the Belgian Nationaal Fonds voor Wetenschappelijk Onderzoek. 相似文献
7.
Assume that a dissimilarity measure between elements and subsets of the set being clustered is given. We define the transformation
of the set of subsets under which each subset is transformed into the set of all elements whose dissimilarity to it is not
greater than a given threshold. Then a cluster is defined as a fixed point of this transformation. Three well-known clustering
strategies are considered from this point of view: hierarchical clustering, graph-theoretic methods, and conceptual clustering.
For hierarchical clustering generalizations are obtained that allow for overlapping clusters and/or clusters not forming a
cover. Three properties of dissimilarity are introduced which guarantee the existence of fixed points for each threshold.
We develop the relation to the theory of quasi-concave set functions, to help give an additional interpretation of clusters. 相似文献
8.
Piecewise hierarchical clustering 总被引:1,自引:0,他引:1
Gildas Brossier 《Journal of Classification》1990,7(2):197-216
We consider two or more ultrametric distance matrices defined over different, possibly overlapping, subsets. These matrices are merged into one ultrametric matrix defined over the whole set. Necessary and sufficient conditions for uniqueness of the merging are established. when these conditions are not satisfied, consistent algorithms are given.The author thanks B. Monjardet and two anonymous referees for their helpful comments. 相似文献
9.
The additive clustering approach is applied to the problem of two-mode clustering and compared with the recent error-variance
approach of Eckes and Orlik (1993). Although the schemes of the computational algorithms look very similar in both of the
approaches, the additive clustering has been shown to have several advantages. Specifically, two technical limitations of
the error-variance approach (see Eckes and Orlik 1993, p. 71) have been overcome in the framework of the additive clustering.
The research was supported by the Office of Naval Research under grant number N0014-93-1-0222 to Rutgers University. The authors
are indebted both to Fionn Murtagh, who served as Acting Editor, and to anonymous Referees for thoughtful and constructive
reviews. 相似文献
10.
J. A. Hartigan 《Journal of Classification》1985,2(1):63-76
A number of statistical models for forming and evaluating clusters are reviewed. Hierarchical algorithms are evaluated by their ability to discover high density regions in a population, and complete linkage hopelessly fails; the others don't do too well either. Single linkage is at least of mathematical interest because it is related to the minimum spanning tree and percolation. Mixture methods are examined, related to k-means, and the failure of likelihood tests for the number of components is noted. The DIP test for estimating the number of modes in a univariate population measures the distance between the empirical distribution function and the closest unimodal distribution function (or k-modal distribution function when testing for k modes). Its properties are examined and multivariate extensions are proposed. Ultrametric and evolutionary distances on trees are considered briefly.Research supported by the National Science Foundation Grant No. MCS-8102280. 相似文献
11.
Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n
3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n
3) worst-case time, can reasonably be expected to exhibit 0(n
2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n
2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n
2) time and space.Whenevern objects are characterized byk-tuples of real numbers, they may be clustered by any of a family of centroid SAHN clustering methods. These methods are based on a geometric model in which clusters are represented by points ink-dimensional real space and points being agglomerated are replaced by a single (centroid) point. For this model, we have solved a class of special packing problems involving point-symmetric convex objects and have exploited it to design an efficient centroid clustering algorithm. Specifically, we describe a centroid SAHN clustering algorithm that requires 0(n
2) time, in the worst case, for fixedk and for a family of dissimilarity measures including the Manhattan, Euclidean, Chebychev and all other Minkowski metrics.This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and by the Austrian Fonds zur Förderung der wissenschaftlichen Forschung. 相似文献
12.
Cross-sectional approach for clustering time varying data 总被引:2,自引:0,他引:2
Cluster analysis is to be performed on a three-mode data matrix of type: units, variables, time. A general model for calculating the distance between two units varying in time is proposed. One particular model is developed and used in an example concerned with clustering of 23 European countries according to the similarity of energy consumption in the years 1976–1982.Supported in part by the Research Council of Slovenia, Yugoslavia. 相似文献
13.
Proportional link linkage (PLL) clustering methods are a parametric family of monotone invariant agglomerative hierarchical clustering methods. This family includes the single, minimedian, and complete linkage clustering methods as special cases; its members are used in psychological and ecological applications. Since the literature on clustering space distortion is oriented to quantitative input data, we adapt its basic concepts to input data with only ordinal significance and analyze the space distortion properties of PLL methods. To enable PLL methods to be used when the numbern of objects being clustered is large, we describe an efficient PLL algorithm that operates inO(n
2 logn) time andO(n
2) space.This work was partially supported by the Natural Sciences and Engineering Research Council of Canada and by the Austrian Fonds zur Förderung der wissenschaftlichen Forschung. 相似文献
14.
Variable selection in clustering 总被引:1,自引:1,他引:1
Standard clustering algorithms can completely fail to identify clear cluster structure if that structure is confined to a subset of the variables. A forward selection procedure for identifying the subset is proposed and studied in the context of complete linkage hierarchical clustering. The basic approach can be applied to other clustering methods, too. 相似文献
15.
A permutation-based algorithm for block clustering 总被引:1,自引:1,他引:1
Hartigan (1972) discusses the direct clustering of a matrix of data into homogeneous blocks. He introduces a stepwise divisive
method for block clustering within a certain class of block structures which induce clustering trees for both row and column
margins. While this class of structures is appealing, the stopping criterion for his method, which is based on asymptotic
theory and the assumption that the individual elements of the data matrix are normally distributed, is quite restrictive.
In this paper we propose a permutation-based algorithm for block clustering within the same class of block structures. By
using permutation arguments to decide where to split and when to stop, our algorithm becomes applicable in a wide variety
of cases, including matrices of categorical data and matrices of small-to-moderate size. In addition, our algorithm offers
considerable flexibility in how block homogeneity is defined. The algorithm is studied in a series of simulation experiments
on matrices of known structure, and illustrated in examples drawn from the fields of taxonomy, political science, and data
architecture. 相似文献
16.
Michael P. Windham 《Journal of Classification》1987,4(2):191-214
The more ways there are of understanding a clustering technique, the more effectively the results can be analyzed and used. I will give a general procedure, calledparameter modification, to obtain from a clustering criterion a variety of equivalent forms of the criterion. These alternative forms reveal aspects of the technique that are not necessarily apparent in the original formulation. This procedure is successful in improving the understanding of a significant number of clustering techniques.The insight obtained will be illustrated by applying parameter modification to partitioning, mixture and fuzzy clustering methods, resulting in a unified approach to the study of these methods and a general algorithm for optimizing them.The author wishes to thank Professor Doctor Hans-Hermann Bock for many stimulating discussions. 相似文献
17.
This paper presents a conditional mixture, maximum likelihood methodology for performing clusterwise linear regression. This new methodology simultaneously estimates separate regression functions and membership inK clusters or groups. A review of related procedures is discussed with an associated critique. The conditional mixture, maximum likelihood methodology is introduced together with the E-M algorithm utilized for parameter estimation. A Monte Carlo analysis is performed via a fractional factorial design to examine the performance of the procedure. Next, a marketing application is presented concerning the evaluations of trade show performance by senior marketing executives. Finally, other potential applications and directions for future research are identified. 相似文献
18.
One of the main techniques embodied in many pattern recognition systems is cluster analysis — the identification of substructure in unlabeled data sets. The fuzzy c-means algorithms (FCM) have often been used to solve certain types of clustering problems. During the last two years several new local results concerning both numerical and stochastic convergence of FCM have been found. Numerical results describe how the algorithms behave when evaluated as optimization algorithms for finding minima of the corresponding family of fuzzy c-means functionals. Stochastic properties refer to the accuracy of minima of FCM functionals as approximations to parameters of statistical populations which are sometimes assumed to be associated with the data. The purpose of this paper is to collect the main global and local, numerical and stochastic, convergence results for FCM in a brief and unified way. 相似文献
19.
Divisive hierarchical clustering algorithms with the diameter criterion proceed by recursively selecting the cluster with
largest diameter and partitioning it into two clusters whose largest diameter is smallest possible. We provide two such algorithms
with complexitiesO(
N
2) andO(N
2logN) respectively, where
denotes the maximum number of clusters in a partition andN the number of entities to be clustered. The former algorithm, an efficient implementation of an algorithm of Hubert, allows
to find all partitions into at most
clusters and is inO(N
2) for fixed
. Moreover, if in each partitioning the size of the largest cluster is bounded byp times the number of entities in the set to be partitioned, with 1/2<=p<1, it provides a complete hierarchy of partitionsO(N
2 logN) time. The latter algorithm, a refinement of an algorithm of Rao allows to build a complete hierarchy of partitions inO(N
2 logN) time without any restriction. Comparative computational experiments with both algorithms and with an agglomerative hierarchical
algorithm of Benzécri are reported.
Résumé Les algorithmes de classification hiérarchique descendante utilisant le critère du diamètre, sélectionnent récursivement la classe de plus grand diamètre et la partitionnent en deux classes, dont le plus grand diamètre est le plus, petit possible. Nous proposons deux tels algorithmes, avec des complexités enO ( N2) etO(N 2 logN) respectivement, où désigne le nombre maximum de classes d'une partition etN le nombre d'objets à classifier. Le premier algorithme, une implantation d'un algorithme de Hubert, permet de construire des partitions avec au plus classes et est enO(N 2) pour fixé. De plus, si dans chaque bipartition le nombre d'objets de la plus grande classe, est borné parp fois le nombre d'objets de l'ensemble à partitionner, où 1/2≤p<1, cet algorithme permet de construire une hiérarchie complète de partitions en tempsO(N 2 logN). Le second algorithme, un raffinement d'un algorithme de Rao, permet de construire une hiérarchie complète de partitions en tempsO(N 2 logN) sans aucune restriction On présente également des résultats de calcul comparatifs pour les deux algorithmes et pour l'algorithme de classification hiérarchique ascendante de Benzécri.相似文献
20.
Recently, algorithms for optimally weighting variables in non-hierarchical and hierarchical clustering methods have been proposed. Preliminary Monte Carlo research has shown that at least one of these algorithms cross-validates extremely well.The present study applies a k-means, optimal weighting procedure to two empirical data sets and contrasts its cross-validation performance with that of unit (i.e., equal) weighting of the variables. We find that the optimal weighting procedure cross-validates better in one of the two data sets. In the second data set its comparative performance strongly depends on the approach used to find seed values for the initial k-means partitioning.The authors would like to acknowledge the support of the Citibank Fellowship from the Sol C. Snider Entrepreneurial Center at the Wharton School. The authors would like to express their appreciation to J. Douglas Carroll and Abba M. Kreiger for comments on an earlier version of the paper. 相似文献