首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于两步策略的中文短文本分类研究
引用本文:樊兴华,王鹏.基于两步策略的中文短文本分类研究[J].大连海事大学学报(自然科学版),2008,34(3).
作者姓名:樊兴华  王鹏
作者单位:重庆邮电大学,计算机科学与技术研究所,重庆,400065;重庆邮电大学,计算机科学与技术研究所,重庆,400065
基金项目:国家自然科学基金 , 重庆市自然科学基金 , 重庆市教委资助项目 , 教育部留学回国人员科研启动基金
摘    要:为更好地挖掘文本信息,研究了将两步策略用于中文短文本分类的3个关键问题,提出了基于组合朴素贝叶斯(NB)和K近邻(KNN)分类器的两步中文短文本分类方法:(1)直接利用NB和KNN的输出构造其对应的二维空间,根据该空间内错误文本的分布将测试文本集分为3部分:能被KNN可靠分类的文本集A,不能被KNN可靠分类但能被NB可靠分类的文本集B,其他文本集C.(2)用KNN、NB分别对文本集A和B进行分类,根据训练语料的类别分布,直接给属于文本集C的文本分配标签.与NB、KNN和支持向量机(SVM)的对比实验表明,该方法可获得较高的分类性能.

关 键 词:中文短文本  文本分类  两步策略  朴素贝叶斯(NB)  K近邻(KNN)

Chinese short-text classification in two-steps
FAN Xing-hua,WANG Peng.Chinese short-text classification in two-steps[J].Journal of Dalian Maritime University,2008,34(3).
Authors:FAN Xing-hua  WANG Peng
Abstract:Three key issues of classifying Chinese short-text in two-steps were discussed to mine text information effectively,and a method of combining naive Bayesian(NB) with k-nearest neighbor(KNN) classifiers for this task was developed.Firstly,the test text collection was divided into three parts: part-A which could be classified reliably by KNN,part-B which could not be classified reliably by KNN but could be classified reliably by NB and the another part-C.All above was implemented by utilizing the outputs of NB or KNN classifier to construct the corresponding two-dimension space respectively,and thereby making the division according to the distribution of texts misclassified in the space.Then,part-A and part-B was classified respectively by using KNN and NB classifiers,and partC was assigned directly the labels according to the distribution of categorization in the training data.The experimental results show that the proposed method achieves high performance comparing with KNN,NB and support vector machine(SVM).
Keywords:Chinese short-text  text classification  two-steps strategy  naive Bayesian(NB)  k-nearest neighbor(KNN)
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号