Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads |
| |
Authors: | Mark Ming-Tso Chiang Boris Mirkin |
| |
Institution: | 1. Department of Computer Science & Information Systems, Birkbeck University of London, London, UK 2. State University - Higher School of Economics, Moscow, Russia
|
| |
Abstract: | The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the
recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental
setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of
between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par
with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them
with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between-
and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such
as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version
of iK-Means, which performs well in the current experiment setting. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|