中文文本分类研究 Study of Chinese Text Categorization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

中文文本分类研究

引用本文：	郝晓燕,常晓明.中文文本分类研究[J].太原理工大学学报,2006,37(6):710-713.

作者姓名：	郝晓燕常晓明

作者单位：	太原理工大学,计算机与软件学院,山西,太原,030024

摘要：	使用k近邻、支持向量机和最大熵模型进行中文文本分类的研究,对目前应用较多的k近邻、支持向量机和最大熵模型,分别进行了基于特征词布尔值和基于特征词词频的中文文本分类实验。实验结果显示,在相同的条件下最大熵方法的分类性能最好,支持向量机次之,k近邻稍差。同时发现,在分类过程中引入了词语频率信息时,分类器的性能略有变化,对于最大熵分类准确率下降1%~2%,对于k近邻有所上升,对于支持向量机则相当。除去文本的特殊性影响,这表明不同程度的词语的信息对不同的机器学习算法有不同的影响。
关键词：	文本分类 k 近邻支持向量机最大熵
文章编号：	1007-9432（2006）06-0710-04
收稿时间：	2006-01-20
修稿时间：	2006年1月20日
Study of Chinese Text Categorization

HAO Xiao-yan,CHANG Xiao-ming.Study of Chinese Text Categorization[J].Journal of Taiyuan University of Technology,2006,37(6):710-713.

Authors:	HAO Xiao-yan CHANG Xiao-ming

Abstract:	In this paper,we compare the three models of k-nearest neighbor,support vector machines and maximum entropy in text categorization. By using two training data set that has been classified by term selection and remove irrelevant data seperately,we carry out some experiments using the three models.The result of the experiments shows that the maximum entropy is better than the other two classifiers on either Boolean value condition or adding the frequency of words.The maximum entropy performance is the best in the three models.We also find that when we add the information of frequency of words the classifiers' performance has some changes.Despite the influence of the particularity of documents,this result suggests that the different kind of term sets may cause different results to different classifier's performance.

Keywords:	text categorization k-nearest neighbor support vector machines maximum entropy
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏