首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于改进K最近邻算法的中文文本分类
引用本文:黄超,陈军华.基于改进K最近邻算法的中文文本分类[J].上海师范大学学报(自然科学版),2019,48(1):96-101.
作者姓名:黄超  陈军华
作者单位:上海师范大学信息与机电工程学院
摘    要:针对文本分类存在的高维文本问题,提出文档频率(DF)-卡方统计量特征提取方式,对特征项进行有效约减,降低文本维度,提高分类精度.在K最近邻(KNN)算法的基础上,针对待分类文本需要和大量训练集样本进行相似度计算的问题,提出一种基于分组中心向量的KNN算法,对类别内的样本集分组求出各组中心向量,使其重新代表训练库计算相似度,降低计算复杂度,提升算法的分类性能.通过实验表明:相较传统KNN算法,改进的算法在准确率、召回率及F值方面都有提升,与其他分类算法相比,具有一定的优势.

关 键 词:文本分类    K最近邻(KNN)算法    特征提取    相似度
收稿时间:2017/9/6 0:00:00

Chinese text classification based on improved K Nearest Neighbor algorithm
HUANG Chao and CHEN Junhua.Chinese text classification based on improved K Nearest Neighbor algorithm[J].Journal of Shanghai Normal University(Natural Sciences),2019,48(1):96-101.
Authors:HUANG Chao and CHEN Junhua
Institution:College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China and College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China
Abstract:This paper focuses on the high dimensional text problems encountered in text classification.Document frequency(DF)-chi square statistic feature extraction method is proposed to reduce the feature items and reduce the dimension of text.Based on the K Nearest Neighbor(KNN) algorithm,in view of the problem that text to be classified should be calculated in similarity with a large number of training set samples,a KNN algorithm based on grouping center vector is proposed.The center vectors of each group were obtained by grouping the sample sets in the category,so as to improve the classification performance of the algorithm.Experiments show that the improved algorithm has improved the precision rate,recall rate and F-measure compared with the traditional KNN algorithm,and it takes advantages of other classification algorithms.
Keywords:text classification  K Nearest Neighbor(KNN)algorithm  feature extraction  similarity
本文献已被 CNKI 等数据库收录!
点击此处可从《上海师范大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《上海师范大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号