大规模数据集聚类的K邻近均匀抽样数据预处理算法 KNN-based even sampling preprocessing algorithm for big dataset期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

大规模数据集聚类的K邻近均匀抽样数据预处理算法

引用本文：	吉成恒,雷咏梅.大规模数据集聚类的K邻近均匀抽样数据预处理算法[J].上海大学学报(自然科学版),2016,22(1):28-35.

作者姓名：	吉成恒雷咏梅

作者单位：	上海大学计算机工程与科学学院, 上海 200444

基金项目：	上海市教委重点学科资助项目(12ZZ09); 上海市科委资助项目(13DZ118800)

摘要：	为解决基于密度的聚类算法处理大规模数据集效率低和存储开销大的问题, 提出一种分片的基于K邻近关系的空间均匀抽样算法作为聚类应用的数据预处理过程, 将数据集分片,按密度降序方式去除数据集中部分样本的K邻居, 将剩余样本作为抽样样本, 在保证精度的同时, 可以降低数据规模, 提升计算效率. 实验结果表明, 在数据规模较大且保证聚类结果准确性的前提下, 通过降低聚类数据规模, 可以有效提升聚类效率.
关键词：	K邻近聚类空间均匀抽样密度降序
收稿时间：	2015-11-20
KNN-based even sampling preprocessing algorithm for big dataset

JI Chengheng,LEI Yongmei.KNN-based even sampling preprocessing algorithm for big dataset[J].Journal of Shanghai University(Natural Science),2016,22(1):28-35.

Authors:	JI Chengheng LEI Yongmei

Affiliation:	School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

Abstract:	To solve the problem of low efficiency and high storage overheads in densitybased clustering algorithms, an algorithm of even data sampling based on K nearest neighbors (KNN) is proposed as a data preprocessing method of clustering applications. The sampling algorithm slices dataset and gets samples evenly. After slicing a dataset, for part of the samples, the algorithm removes each sample’s K nearest neighbors in a descending order according to the density. The remaining samples are then used as the sample dataset. Experimental results show that, with the increase of data size and the guaranteed accuracy, the sampling algorithm can effectively improve efficiency of clustering by reducing the amount of data needed in clustering.

Keywords:	K nearest neighbors (KNN) clustering density descending order spatial even sampling
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《上海大学学报(自然科学版)》浏览原始摘要信息
	点击此处可从《上海大学学报(自然科学版)》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏