一种不需分词的中文文本分类方法 Chinese Text Classification Without Word Segmentation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一种不需分词的中文文本分类方法

引用本文：	许云,樊孝忠,张锋.一种不需分词的中文文本分类方法[J].北京理工大学学报,2005,25(9):778-781.

作者姓名：	许云樊孝忠张锋

作者单位：	北京理工大学,信息科学技术学院计算机科学工程系,北京,100081

基金项目：	云南省信息技术专项基金

摘要：	提出了一种不需分词的n元语法文本分类方法.与传统文本分类模型相比,该方法在字的级别上利用了n元语法模型,文本分类时无需进行分词,并且避免了可能造成有用信息丢失的特征选择过程.由于字的数量远小于词的数量,所以该分类方法与其它在词级别上的分类方法相比,有效地降低了数据稀疏带来的影响.系统地研究了模型中的关键因素以及它们对分类结果的影响.使用中文TREC提供的数据进行实验,结果表明,综合评价指标Fβ=1达到86.8%.
关键词：	文本分类分词 n元语法模型分词中文文本分类方法 Word Segmentation Text Classification 评价指标综合分类结果实验数据稀疏 TREC 使用因素语法模型研究系统影响选择过程特征信息丢失
文章编号：	1001-0645(2005)09-0778-04
收稿时间：	10 15 2004 12:00AM
修稿时间：	2004年10月15日
Chinese Text Classification Without Word Segmentation

XU Yun,FAN Xiao-zhong and ZHANG Feng.Chinese Text Classification Without Word Segmentation[J].Journal of Beijing Institute of Technology(Natural Science Edition),2005,25(9):778-781.

Authors:	XU Yun FAN Xiao-zhong and ZHANG Feng

Institution:	Department of Computer Science and Engineering, School of Information Science and Technology, Beijing Institute of Technology, Beijing 100081, China

Abstract:	Proposes an approach for Chinese language text classification without word segmentation based on n-gram language modeling. Unlike the case of traditional text classification models, the approach based on character level n-gram modeling avoids word segmentation and explicit feature selection procedures that tends to lose significant amount of useful information. It greatly reduces the problem of sparsity of data, because the size of the vocabulary made up of characters is smaller than that formed from words. Systematic study of key factors in language modeling and their influence on classification shows that the estimated index based on experiments on Chinese TREC attained 86.8%.

Keywords:	text classification word segmentation n-gram model
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《北京理工大学学报》浏览原始摘要信息
	点击此处可从《北京理工大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏