首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于DBSCAN聚类的改进KNN文本分类算法
引用本文:苟和平.基于DBSCAN聚类的改进KNN文本分类算法[J].科学技术与工程,2013,13(1):219-222.
作者姓名:苟和平
作者单位:1. 琼台师范高等专科学校信息技术系,海口,571100
2. 西北师范大学计算机科学与工程学院,兰州,730070
基金项目:教育部科学技术研究重点项目
摘    要:K最近邻算法(KNN)在分类时,需要计算待分类样本与训练样本集中每个样本之间的相似度.当训练样本过多时,计算代价大,分类效率降低.因此,提出一种基于DBSCAN聚类的改进算法.利用DBSCAN聚类消除训练样本的噪声数据.同时,对于核心样本集中的样本,根据其样本相似度阈值和密度进行样本裁剪,以缩减与待分类样本计算相似度的训练样本个数.实验表明此算法能够在保持基本分类能力不变的情况下,有效地降低分类计算量.

关 键 词:K最近邻  文本分类  样本裁剪
收稿时间:8/24/2012 1:04:14 AM
修稿时间:9/26/2012 9:13:40 PM

An Improved KNN Text Categorization Algorithm Based on DBSCAN
gouheping.An Improved KNN Text Categorization Algorithm Based on DBSCAN[J].Science Technology and Engineering,2013,13(1):219-222.
Authors:gouheping
Institution:2(Department of Information Technology,Qiongtai Teachers College 1,Haikou 571100,P.R.China;College of Computer Science and Engineering,Northwest Normal University 2,Lanzhou 730070,P.R.China)
Abstract:In order to find k neighbors of classification, KNN algorithm needs to calculate the similarity between the test sample and every training sample in sample space, with the increasing in the number of training samples, the computational overhead becomes higher. Aiming at the problem of the KNN, this paper proposes an improved algorithm based on DBSCAN to reduce the number of training samples. The noisy data in sample space were reduced with DBSCAN algorithm, furthermore, the part of highly similar samples in kernel set of training data were reduced according to the similarity threshold and density. It is shown that the improved method can reduce computational overhead effectively.
Keywords:KNN  text classification  sample reduction
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《科学技术与工程》浏览原始摘要信息
点击此处可从《科学技术与工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号