首页 | 本学科首页   官方微博 | 高级检索  
     

基于同义词词林的文本特征选择方法
引用本文:郑艳红,张东站. 基于同义词词林的文本特征选择方法[J]. 厦门大学学报(自然科学版), 2012, 51(2): 200-203
作者姓名:郑艳红  张东站
作者单位:厦门大学信息科学与技术学院,福建厦门,361005
基金项目:国家自然科学基金项目(50604012)
摘    要:特性选择是文本分类、机器学习以及模式识别领域的重要问题之一.特征选择能在保证数据完整性的情况下减少高维数据的特征维数,同时提高分类的精度.以往提出的基于同义词词林的特征选择方法虽然能有效避免提取出的特征值在概念上的重复性,但并未考虑到权值最优的特征向量构成的子集可能并非是最优的.为了解决此问题,结合同义词和遗传算法,提出了一种新的基于同义词词林的文本特征选择方法.该方法首先对特征词进行同义词过滤、合并,在降低特征向量维度的同时避免了同义词带来的影响.然后采用改进的遗传算法选出具有较好适应度值的特征向量.实验结果表明,这种方法较之以往提出的方法,在保证特征选择准确率的基础上能明显地减小特征向量的维度.

关 键 词:特征选择  同义词词林  遗传算法  文本分类

A Text Feature Selection Method Based on TongYiCi CiLin
ZHENG Yan-hong , ZHANG Dong-zhan. A Text Feature Selection Method Based on TongYiCi CiLin[J]. Journal of Xiamen University(Natural Science), 2012, 51(2): 200-203
Authors:ZHENG Yan-hong    ZHANG Dong-zhan
Affiliation:*(School of Information Science and Technology,Xiamen University,Xiamen 361005,China)
Abstract:Feature selection is one of important problems in text categorization,machine learning and pattern recognition.In particular,with the rapid development of network and cloud computing,the massive data analysis methods are vitally important.Feature selection can reduce high dimension data′s feature dimension under the condition of ensuring data integrity and classification accuracy.Previously proposed feature selection method based on TongYiCi CiLin can effectively avoid the eigenvalue repetitive in concept,but they did′t consider about that subset composed by the optimal weight of feature vectors may not the best one.To solve this problem,this article combine the TongYiCi and Genetic Algorithm,proposed a text feature selection method based on TongYiCi CiLin.The experiment results show that the method can reduce feature vector′s dimension and improve the efficiency of feature selection.
Keywords:feature selection  TongYiCi CiLin  genetic algorithm  text categorization
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号