首页 | 本学科首页   官方微博 | 高级检索  
     

不均衡数据集中KNN分类器样本裁剪算法
引用本文:苟和平. 不均衡数据集中KNN分类器样本裁剪算法[J]. 科学技术与工程, 2013, 13(16): 4720-4723
作者姓名:苟和平
作者单位:琼台师范高等专科学校
基金项目:教育部科学技术研究重点项目(208148);海南省自然科学基金项目(612136);琼台师范高等专科学校项目(qtkz201115)
摘    要:针对KNN算法在分类时的样本相似度计算开销大,在处理不均衡数据集时少数类分类误差大的问题,提出一种在不均衡数据集下基于密度的训练样本裁剪算法。对训练样本的各个样本类进行聚类,删除噪声数据并计算每个样本类的平均相似度和样本平均密度,以此获得样本类裁剪的相似度阈值,然后将样本类内相似度小于类相似度阈值的样本进行合并,减少训练样本总数。实验表明,此样本裁剪算法能够在保持KNN算法分类性能基本稳定的前提下,有效地减少分类计算开销,并能在一定程度上提高少数类的分类性能。

关 键 词:KNN分类  聚类  样本裁剪  密度  相似度
收稿时间:2013-02-04
修稿时间:2013-02-04

Algorithm for reducing training data on imbalanced data sets in KNN text classification
gouheping. Algorithm for reducing training data on imbalanced data sets in KNN text classification[J]. Science Technology and Engineering, 2013, 13(16): 4720-4723
Authors:gouheping
Affiliation:2(Department of Information Technology,Qiongtai Teachers College 1,Haikou 571100,P.R.China;College of Computer Science and Engineering,Northwest Normal University 2,Lanzhou 730070,P.R.China)
Abstract:KNN classifier has the high computational overhead of similarity computing, and it has poor performance in the minority class prediction when it used to deal with the imbalanced data sets, an algorithm for reducing training data on imbalanced data sets is presented, which gathers the class into several clusters by clustering, deletes the noise data, compute the average similarity and the average density of each class. Then the samples of the class with the similarity smaller than the threshold are combined to reduce the number of training samples. The experiments show that the method can reduce the computational overhead significantly, improve the classification performance of the minority class, and the classification stability of the KNN algorithm is maintained.
Keywords:KNN text classification   clustering   samples reducing   density   similarity
本文献已被 CNKI 等数据库收录!
点击此处可从《科学技术与工程》浏览原始摘要信息
点击此处可从《科学技术与工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号