首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于统计词典和特征加强的多语言文本分类
引用本文:龚静,李英杰,黄欣阳.基于统计词典和特征加强的多语言文本分类[J].西南师范大学学报(自然科学版),2018,43(9):45-50.
作者姓名:龚静  李英杰  黄欣阳
作者单位:湖南环境生物职业技术学院公共基础课部;南华大学计算机学院
基金项目:国家自然科学基金项目(60572137);湖南省教育厅项目(12C1056;17C0599).
摘    要:在统计双语词典的基础上,提出一种特征加强的多语言文本分类方法.在执行文本分类时,考虑到其他语言的训练文本,使得多种语言的文本集合中均存在训练文本,放松了MLTC的要求.特征加强是一种交叉检查过程,即获取两种语言所有特征的卡方统计后,通过语言中相关特征的辨识力,再次对语言的特征辨识力进行评估,以提高分类的可信度.实验选择汉语或英语作为目标语言.实验结果表明:提出的方法具有更高的分类精度,且对训练集规格的敏感度更低.

关 键 词:多语言文本分类  双语词典  特征加强  交叉检查  敏感度
收稿时间:2016/11/22 0:00:00

Multiple Language Text Classification Method Based on Statistical Dictionary and Feature Enhancing
GONG Jing,LI Ying-jie,HUANG Xin-yang.Multiple Language Text Classification Method Based on Statistical Dictionary and Feature Enhancing[J].Journal of Southwest China Normal University(Natural Science),2018,43(9):45-50.
Authors:GONG Jing  LI Ying-jie  HUANG Xin-yang
Institution:1. Department of Public Basic Course, Hunan Polytechnic of Environment and Biology, Hengyang Hunan 421005, China;2. Computer School, University of South China, Hengyang Hunan 421001, China
Abstract:Aiming at the problem that multiple language text classification (MLTC) can only solve single language text classification problem of multiple independent, on the basic of statistical bilingual dictionary, multiple language text classification based on feature enhancing has been proposed. In the implementation of text classification, the training texts of other languages have been taken into account, which makes the text of a variety of languages in the training texts. And it relaxes MLTC requirements. Feature enhancing is a processing of cross examination. After chi square statistics of all the features for the two languages is obtained, the identification of language feature is reassessed through the feature identification to improve the reliability of classification. Chinese or English is chosen as the target language in the experiment. Experimental results show that the proposed method has a higher classification accuracy, and the sensitivity of the training set is lower.
Keywords:multiple language text classification  bilingual dictionary  feature enhancing  cross examination  sensitivity
本文献已被 CNKI 等数据库收录!
点击此处可从《西南师范大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《西南师范大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号