共查询到20条相似文献,搜索用时 27 毫秒
1.
Maximum sum-of-splits clustering 总被引:1,自引:1,他引:0
ConsiderN entities to be classified, and a matrix of dissimilarities between pairs of them. The split of a cluster is the smallest dissimilarity between an entity of this cluster and an entity outside it. The single-linkage algorithm provides partitions intoM clusters for which the smallest split is maximum. We study here the average split of the clusters or, equivalently, the sum of splits. A (N
2) algorithm is provided to determine maximum sum-of-splits partitions intoM clusters for allM betweenN – 1 and 2, using the dual graph of the single-linkage dendrogram.
Résumé SoientN objets à classifier et une matrice de dissimilarit és entre paires de ces objets. L'écart d'une classe est la plus petite dissimilarité entre un objet de cette classe et un objet en dehors d'elle. L'algorithme du lien simple fournit des partitions enM classes dont le plus petit écart est maximum. On étudie l'écart moyen des classes, ou, ce qui est équivalent, la somme des écarts. On propose un algorithme en (N 2) pour déterminer des partitions enM classes dont la somme des écarts est maximum pourM allant deN – 1 à 2, basé sur le graphe dual du dendrogramme de la méthode du lien simple.相似文献
2.
Consider N entities to be classified (e.g., geographical areas), a matrix of dissimilarities
between pairs of entities, a graph H with vertices associated with these entities such that the edges join the vertices corresponding to contiguous entities. The split of a
cluster is the smallest dissimilarity between an entity of this cluster and an entity outside of
it. The single-linkage algorithm (ignoring contiguity between entities) provides partitions into M clusters for which the smallest split of the clusters, called split of the partition, is
maximum. We study here the partitioning of the set of entities into M connected clusters
for all M between N - 1 and 2 (i.e., clusters such that the subgraphs of H induced by their
corresponding sets of entities are connected) with maximum split subject to that condition.
We first provide an exact algorithm with a (N2) complexity for the particular case in which H is a tree. This algorithm suggests in turn a first heuristic algorithm for the general problem. Several variants of this heuristic are Also explored. We then present an exact
algorithm for the general case based on iterative determination of cocycles of subtrees and on the solution of auxiliary set covering problems. As solution of the latter problems is
time-consuming for large instances, we provide another heuristic in which the auxiliary
set covering problems are solved approximately. Computational results obtained with the
exact and heuristic algorithms are presented on test problems from the literature. 相似文献
3.
Dendrograms are widely used to represent graphically the clusters and partitions obtained with hierarchical clustering schemes. Espaliers are generalized dendrograms in which the length of horizontal lines is used in addition to their level in order to display the values of two characteristics of each cluster (e.g., the split and the diameter) instead of only one. An algorithm is first presented to transform a dendrogram into an espalier without rotation of any part of the former. This is done by stretching some of the horizontal lines to obtain a diagram with vertical and horizontal lines only, the cutting off by diagonal lines the parts of the horizontal lines exceeding their prescribed length. The problem of finding if, allowing rotations, no diagonal lines are needed is solved by anO(N
2) algorithm whereN is the number of entities to be classified. This algorithm is the generalized to obtain espaliers with minimum width and, possibly, some diagonal lines.Work of the first and second authors has been supported by FCAR (Fonds pour la Formation de Chercheurs et l'Aide à la Recherche) grant 92EQ1048, and grant N00014-92-J-1194 from the Office of Naval Research. Work of the first author has also been supported by NSERC (Natural Sciences and Engineering Research Council of Canada) grant to École des Hautes Études Commerciales, Montréal and by NSERC grant GP0105574. Work of the second author has been supported by NSERC grant GP0036426, by FCAR grant 90NC0305, and by an NSF Professorship for Women in Science at Princeton University from September 1990 until December 1991. Work of the third author was done in part during a visit to GERAD, Montréal. 相似文献
4.
Minimum sum of diameters clustering 总被引:1,自引:1,他引:0
The problem of determining a partition of a given set ofN entities intoM clusters such that the sum of the diameters of these clusters is minimum has been studied by Brucker (1978). He proved that it is NP-complete forM3 and mentioned that its complexity was unknown forM=2. We provide anO(N
3 logN) algorithm for this latter case. Moreover, we show that determining a partition into two clusters which minimizes any given function of the diameters can be done inO(N
5) time.Acknowledgments: This research was supported by the Air Force Office of Scientific Research Grant AFOSR 0271 to Rutgers University. We are grateful to Yves Crama for several insightful remarks and to an anonymous referee for detailed comments. 相似文献
5.
Efficient algorithms for divisive hierarchical clustering with the diameter criterion 总被引:1,自引:1,他引:0
Divisive hierarchical clustering algorithms with the diameter criterion proceed by recursively selecting the cluster with
largest diameter and partitioning it into two clusters whose largest diameter is smallest possible. We provide two such algorithms
with complexitiesO(
N
2) andO(N
2logN) respectively, where
denotes the maximum number of clusters in a partition andN the number of entities to be clustered. The former algorithm, an efficient implementation of an algorithm of Hubert, allows
to find all partitions into at most
clusters and is inO(N
2) for fixed
. Moreover, if in each partitioning the size of the largest cluster is bounded byp times the number of entities in the set to be partitioned, with 1/2<=p<1, it provides a complete hierarchy of partitionsO(N
2 logN) time. The latter algorithm, a refinement of an algorithm of Rao allows to build a complete hierarchy of partitions inO(N
2 logN) time without any restriction. Comparative computational experiments with both algorithms and with an agglomerative hierarchical
algorithm of Benzécri are reported.
Résumé Les algorithmes de classification hiérarchique descendante utilisant le critère du diamètre, sélectionnent récursivement la classe de plus grand diamètre et la partitionnent en deux classes, dont le plus grand diamètre est le plus, petit possible. Nous proposons deux tels algorithmes, avec des complexités enO ( N2) etO(N 2 logN) respectivement, où désigne le nombre maximum de classes d'une partition etN le nombre d'objets à classifier. Le premier algorithme, une implantation d'un algorithme de Hubert, permet de construire des partitions avec au plus classes et est enO(N 2) pour fixé. De plus, si dans chaque bipartition le nombre d'objets de la plus grande classe, est borné parp fois le nombre d'objets de l'ensemble à partitionner, où 1/2≤p<1, cet algorithme permet de construire une hiérarchie complète de partitions en tempsO(N 2 logN). Le second algorithme, un raffinement d'un algorithme de Rao, permet de construire une hiérarchie complète de partitions en tempsO(N 2 logN) sans aucune restriction On présente également des résultats de calcul comparatifs pour les deux algorithmes et pour l'algorithme de classification hiérarchique ascendante de Benzécri.相似文献
6.
M. Anthony Wong 《Journal of Classification》1984,1(1):255-270
A random sample of sizeN is divided intok clusters that minimize the within clusters sum of squares locally. Some large sample properties of this k-means clustering method (ask approaches withN) are obtained. In one dimension, it is established that the sample k-means clusters are such that the within-cluster sums of squares are asymptotically equal, and that the sizes of the cluster intervals are inversely proportional to the one-third power of the underlying density at the midpoints of the intervals. The difficulty involved in generalizing the results to the multivariate case is mentioned.This research was supported in part by the National Science Foundation under Grant MCS75-08374. The author would like to thank John Hartigan and David Pollard for helpful discussions and comments. 相似文献
7.
In this paper, we propose a bicriterion objective function for clustering a given set ofN entities, which minimizes [d–(1–)s], where 01, andd ands are the diameter and the split of the clustering, respectively. When =1, the problem reduces to minimum diameter clustering, and when =0, maximum split clustering. We show that this objective provides an effective way to compromise between the two often conflicting criteria. While the problem is NP-hard in general, a polynomial algorithm with the worst-case time complexityO(N
2) is devised to solve the bipartition version. This algorithm actually gives all the Pareto optimal bipartitions with respect to diameter and split, and it can be extended to yield an efficient divisive hierarchical scheme. An extension of the approach to the objective [(d
1+d
2)–2(1–)s] is also proposed, whered
1 andd
2 are diameters of the two clusters of a bipartition.This research was supported in part by the National Science and Engineering Research Council of Canada (Grant OGP 0104900). The authors wish to thank two anonymous referees, whose detailed comments on earlier drafts improved the paper. 相似文献
8.
Clique optimization: A method to construct parsimonious ultrametric trees from similarity data 总被引:1,自引:1,他引:0
N. Sriram 《Journal of Classification》1990,7(1):33-52
9.
This research note focuses on a problem where the cluster sizes for two partitions of the same object set are assumed known;
however, the actual assignments of objects to clusters are unknown for one or both partitions. The objective is to find a
contingency table that produces maximum possible agreement between the two partitions, subject to constraints that the row
and column marginal frequencies for the table correspond exactly to the cluster sizes for the partitions. This problem was
described by H. Messatfa (Journal of Classification, 1992, pp. 5–15), who provided a heuristic procedure based on the linear transportation problem. We present an exact solution
procedure using binary integer programming. We demonstrate that our proposed method efficiently obtains optimal solutions
for problems of practical size.
We would like to thank the Editor, Willem Heiser, and an anonymous reviewer for helpful comments that resulted in improvements
of this article. 相似文献
10.
An error variance approach to two-mode hierarchical clustering 总被引:2,自引:2,他引:0
A new agglomerative method is proposed for the simultaneous hierarchical clustering of row and column elements of a two-mode
data matrix. The procedure yields a nested sequence of partitions of the union of two sets of entities (modes). A two-mode
cluster is defined as the union of subsets of the respective modes. At each step of the agglomerative process, the algorithm
merges those clusters whose fusion results in the smallest possible increase in an internal heterogeneity measure. This measure
takes into account both the variance within the respective cluster and its centroid effect defined as the squared deviation
of its mean from the maximum entry in the input matrix. The procedure optionally yields an overlapping cluster solution by
assigning further row and/or column elements to clusters existing at a preselected hierarchical level. Applications to real
data sets drawn from consumer research concerning brand-switching behavior and from personality research concerning the interaction
of behaviors and situations demonstrate the efficacy of the method at revealing the underlying two-mode similarity structure. 相似文献
11.
The Metric Cutpoint Partition Problem 总被引:1,自引:1,他引:0
Let G = (V, E,w) be a graph with vertex and edge sets V and E, respectively, and w: E → a function which assigns a positive weight or length to each edge of G. G is called a realization of a finite metric space (M, d), with M = {1, ..., n} if and only if {1, ..., n} ⊆ V and d(i, j) is equal to the length of the shortest chain linking i and j in G ∀i, j = 1, ..., n. A realization G of (M, d), is called optimal if the sum of its weights is minimal among all the realizations of (M, d). A cutpoint in a graph G is a vertex whose removal strictly increases the number of connected components of G. The Metric Cutpoint Partition Problem is to determine if a finite metric space (M, d) has an optimal realization containing a cutpoint. We prove in this paper that this problem is polynomially solvable. We
also describe an algorithm that constructs an optimal realization of (M, d) from optimal realizations of subspaces that do not contain any cutpoint.
Supported by grant PA002-104974/2 from the Swiss National Science Foundation.
Published online xx, xx, xxxx. 相似文献
12.
T clusters, based on J distinct, contributory
partitions (or, equivalently, J polytomous attributes). We describe
a new model/algorithm for implementing this objective. The method's objective
function incorporates a modified Rand measure, both in initial cluster selection
and in subsequent refinement of the starting partition. The method is applied to
both synthetic and real data. The performance of the proposed model is compared
to latent class analysis of the same data set. 相似文献
13.
k consisting of k clusters, with k > 2. Bottom-up agglomerative approaches are also commonly used to construct partitions,
and we discuss these in terms of worst-case performance for metric data sets. Our main contribution derives from a new restricted
partition formulation that requires each cluster to be an interval of a given ordering of the objects being clustered. Dynamic programming can optimally split such an ordering into a partition Pk for a large class of objectives that includes min-diameter. We explore a variety of ordering heuristics and show that our
algorithm, when combined with an appropriate ordering heuristic, outperforms traditional algorithms on both random and non-random
data sets. 相似文献
14.
The median procedure for n-trees 总被引:2,自引:2,他引:0
Let (X,d) be a metric space The functionM:X
k 2
x
defined by
is the minimum } is called themedian procedure and has been found useful in various applications involving the notion of consensus Here we present axioms that characterizeM whenX is a certain class of trees (hierarchical classifications), andd is the symmetric difference metricWe would like to thank the referees and Editor for helpful comments 相似文献
15.
16.
Single linkage clusters on a set of points are the maximal connected sets in a graph constructed by connecting all points
closer than a given threshold distance. The complete set of single linkage clusters is obtained from all the graphs constructed
using different threshold distances. The set of clusters forms a hierarchical tree, in which each non-singleton cluster divides
into two or more subclusters; the runt size for each single linkage cluster is the number of points in its smallest subcluster.
The maximum runt size over all single linkage clusters is our proposed test statistic for assessing multimodality. We give
significance levels of the test for two null hypotheses, and consider its power against some bimodal alternatives.
Research partially supported by NSF Grant No. DMS-8617919. 相似文献
17.
Elena Rubei 《Journal of Classification》2016,33(2):282-297
Let \( \mathcal{G} \) = (G,w) be a weighted simple finite connected graph, that is, let G be a simple finite connected graph endowed with a function w from the set of the edges of G to the set of real numbers. For any subgraph G′ of G, we define w(G′) to be the sum of the weights of the edges of G′. For any i, j vertices of G, we define D {i,j}(\( \mathcal{G} \)) to be the minimum of the weights of the simple paths of G joining i and j. The D {i,j}(\( \mathcal{G} \)) are called 2-weights of \( \mathcal{G} \). Weighted graphs and their reconstruction from 2-weights have applications in several disciplines, such as biology and psychology.Let \( {\left\{{m}_I\right\}}_{I\in \left(\frac{\left\{1,\dots, n\right\}}{2}\right)} \) and \( {\left\{{M}_I\right\}}_{I\in \left(\frac{\left\{1,\dots, n\right\}}{2}\right)} \) be two families of positive real numbers parametrized by the 2-subsets of {1, …, n} with m I ≤ M I for any I; we study when there exist a positive-weighted graph G and an n-subset {1, …, n} of the set of its vertices such that D I (\( \mathcal{G} \)) ∈ [m I ,M I ] for any \( I\in \left(\frac{\left\{1,\dots, n\right\}}{2}\right) \). Then we study the analogous problem for trees, both in the case of positive weights and in the case of general weights. 相似文献
18.
Aggregation of equivalence relations 总被引:1,自引:1,他引:0
Each ofn attributes partitions a set of items into equivalence classes. Aconsistent aggregator of then partitions is defined as an aggregate partition that satisfies an independence condition and a unanimity condition. It is shown that the class of consistent aggregators is precisely the class ofconjunctive aggregators. That is, for each consistent aggregator there is a nonempty subsetN of the attributes such that two items are equivalent in the aggregate partition if and only if they are equivalent with respect to each attribute inN. 相似文献
19.
Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software 总被引:1,自引:0,他引:1
K -means partitioning. We also describe some new features and improvements to the algorithm proposed by De Soete. Monte Carlo
simulations have been conducted using different error conditions. In all cases (i.e., ultrametric or additive trees, or K-means partitioning), the simulation results indicate that the optimal weighting procedure should be used for analyzing data
containing noisy variables that do not contribute relevant information to the classification structure. However, if the data
involve error-perturbed variables that are relevant to the classification or outliers, it seems better to cluster or partition
the entities by using variables with equal weights. A new computer program, OVW, which is available to researchers as freeware,
implements improved algorithms for optimal variable weighting for ultrametric and additive tree clustering, and includes a
new algorithm for optimal variable weighting for K-means partitioning. 相似文献
20.
In agglomerative hierarchical clustering, pair-group methods suffer from a problem of non-uniqueness when two or more distances
between different clusters coincide during the amalgamation process. The traditional approach for solving this drawback has
been to take any arbitrary criterion in order to break ties between distances, which results in different hierarchical classifications
depending on the criterion followed. In this article we propose a variable-group algorithm that consists in grouping more
than two clusters at the same time when ties occur. We give a tree representation for the results of the algorithm, which
we call a multidendrogram, as well as a generalization of the Lance andWilliams’ formula which enables the implementation of the algorithm in a recursive
way.
The authors thank A. Arenas for discussion and helpful comments. This work was partially supported by DGES of the Spanish
Government Project No. FIS2006–13321–C02–02 and by a grant of Universitat Rovira i Virgili. 相似文献