中文停用词表的自动选取 Automatic Selection of Chinese Stoplist期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

中文停用词表的自动选取

引用本文：	顾益军,樊孝忠,王建华,汪涛,黄维金.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340.

作者姓名：	顾益军樊孝忠王建华汪涛黄维金

作者单位：	1. 北京理工大学,信息科学技术学院计算机科学工程系,北京,100081 2. 中国公安大学,信息安全工程系,北京,100038

摘要：	通过对现有基于统计的停用词选取方法的考察,提出了一种新的停用词选取方法.用该方法分别计算词条在语料库中各个句子内发生的概率和包含该词条的句子在语料库中的概率,在此基础上计算它们的联合熵,依据联合熵选取停用词.将该方法与传统方法选取的停用词表进行了对比,并比较了将各种方法用于文本分类的预处理时对分类效果的影响.实验结果表明,该方法更好地避免了语料的行文格式对停用词选取的影响,比传统方法更适用于文本分类的预处理.
关键词：	停用词中文停用词表联合熵中文停用词词表自动方法选取 Chinese Selection 格式行文语料库结果实验影响分类效果预处理文本分类比较联合熵概率发生
文章编号：	1001-0645(2005)04-0337-04
收稿时间：	6/3/2004 12:00:00 AM
修稿时间：	2004年6月3日
Automatic Selection of Chinese Stoplist

GU Yi-jun,FAN Xiao-zhong,WANG Jian-hu,WANG Tao and HUANG Wei-jin.Automatic Selection of Chinese Stoplist[J].Journal of Beijing Institute of Technology(Natural Science Edition),2005,25(4):337-340.

Authors:	GU Yi-jun FAN Xiao-zhong WANG Jian-hu WANG Tao and HUANG Wei-jin

Institution:	Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing100081, China;Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing100081, China;Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing100081, China;Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing100081, China;Department of Information Security Science and Technology, China Security University, Beijing100038, China

Abstract:	By investigating the methods of automatically selecting stop words based on statistical methods, a new method is proposed. The idea of this method is to calculate the probability that the word occurs in each sentence of corpus, and calculate the probability that the sentences include the word occuring in corpus, then calculate the entropy of these probabilities, and select stop words according to the entropy. The stoplist determined by this method is compared with that determined by the traditional methods, the effects of various preprocessing methods on the categorization are compared also. The experiments show that the method is better in avoiding the impact of the style or manner of writing in corpus on choosing the stoplist, and more suitable for preprocessing the text categorization than traditional methods.

Keywords:	stop word Chinese stoplist union entropy
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《北京理工大学学报》浏览原始摘要信息
	点击此处可从《北京理工大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏