首页 | 本学科首页   官方微博 | 高级检索  
     检索      

中文停用词表的自动选取
引用本文:顾益军,樊孝忠,王建华,汪涛,黄维金.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340.
作者姓名:顾益军  樊孝忠  王建华  汪涛  黄维金
作者单位:1. 北京理工大学,信息科学技术学院计算机科学工程系,北京,100081
2. 中国公安大学,信息安全工程系,北京,100038
摘    要:通过对现有基于统计的停用词选取方法的考察,提出了一种新的停用词选取方法.用该方法分别计算词条在语料库中各个句子内发生的概率和包含该词条的句子在语料库中的概率,在此基础上计算它们的联合熵,依据联合熵选取停用词.将该方法与传统方法选取的停用词表进行了对比,并比较了将各种方法用于文本分类的预处理时对分类效果的影响.实验结果表明,该方法更好地避免了语料的行文格式对停用词选取的影响,比传统方法更适用于文本分类的预处理.

关 键 词:停用词  中文停用词表  联合熵  中文  停用词  词表  自动  方法选取  Chinese  Selection  格式  行文  语料库  结果  实验  影响  分类效果  预处理  文本分类  比较  联合熵  概率  发生
文章编号:1001-0645(2005)04-0337-04
收稿时间:6/3/2004 12:00:00 AM
修稿时间:2004年6月3日

Automatic Selection of Chinese Stoplist
GU Yi-jun,FAN Xiao-zhong,WANG Jian-hu,WANG Tao and HUANG Wei-jin.Automatic Selection of Chinese Stoplist[J].Journal of Beijing Institute of Technology(Natural Science Edition),2005,25(4):337-340.
Authors:GU Yi-jun  FAN Xiao-zhong  WANG Jian-hu  WANG Tao and HUANG Wei-jin
Institution:Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing100081, China;Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing100081, China;Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing100081, China;Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing100081, China;Department of Information Security Science and Technology, China Security University, Beijing100038, China
Abstract:By investigating the methods of automatically selecting stop words based on statistical methods, a new method is proposed. The idea of this method is to calculate the probability that the word occurs in each sentence of corpus, and calculate the probability that the sentences include the word occuring in corpus, then calculate the entropy of these probabilities, and select stop words according to the entropy. The stoplist determined by this method is compared with that determined by the traditional methods, the effects of various preprocessing methods on the categorization are compared also. The experiments show that the method is better in avoiding the impact of the style or manner of writing in corpus on choosing the stoplist, and more suitable for preprocessing the text categorization than traditional methods.
Keywords:stop word  Chinese stoplist  union entropy
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《北京理工大学学报》浏览原始摘要信息
点击此处可从《北京理工大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号