首页 | 本学科首页   官方微博 | 高级检索  
     检索      

中文文本特征选择中的分词方法研究
引用本文:黄魏.中文文本特征选择中的分词方法研究[J].科学技术与工程,2010(1).
作者姓名:黄魏
作者单位:国防科技大学信息系统与管理学院
基金项目:“十一五”武器装备预先研究项目(513300102)
摘    要:针对汉语自动分词后词条的特征信息缺失的问题,本文提出把整个分词过程分解为三个子过程,以词串为分词单位对文本进行分词:首先,采用逆向最大匹配法对文本进行切分;第二,对切分结果进行停用词消除;第三,计算第一次分词得到的词条互信息和相邻共现频次,根据计算结果判定相应的词条组合成词串。实验结果表明,词条组合后的词串的特征信息更丰富,改善了文本特征选择的效果,提高了文本分类性能。

关 键 词:文本  文本特征  分词  词条
收稿时间:9/15/2009 2:12:45 PM
修稿时间:9/15/2009 2:12:45 PM

Study on Method of Word Segmentation in Feature Selection in Chinese Text Categorization
HuangWei.Study on Method of Word Segmentation in Feature Selection in Chinese Text Categorization[J].Science Technology and Engineering,2010(1).
Authors:HuangWei
Abstract:Since the automatic of Chinese word will bring the lack of information, we provide word segmentation according to lexical chunk as the unit. We divide such segmenting process into three sub-process: firstly, we segment text by means of Backward Maximum Matching. Second, we delete the stop-words from the segmentation result. At last, we count words mutual information and adjacency by the first time we segment words, and then, according to this counting result we can judge and sign the lexical chunk by relevant words. The experimentation shows that after the word combination, the lexical chunk bear much more feature information which shares a better effect of the process. It also has proved the effect of Feature Selection in Chinese Text Categorization and enhanced the capability of text classification.
Keywords:Text  Text Feature  Word Segmentation  words
点击此处可从《科学技术与工程》浏览原始摘要信息
点击此处可从《科学技术与工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号