首页 | 本学科首页   官方微博 | 高级检索  
     检索      

A New Approach of Feature Selection for Text Categorization
作者姓名:CUI  Zifeng  XU  Baowen  ZHANG  Weifeng  XU  Junling
作者单位:[1]School of Computer Science and Engineering,Southeast University, Nanjing 210096, Jiangsu, China [2]Department of Computer Science and Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003, Jiangsu, China
摘    要:This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.

关 键 词:特征选择  独立  CHI测试  文本分类
文章编号:1007-1202(2006)05-1335-05
收稿时间:2006-03-10

A new approach of feature selection for text categorization
CUI Zifeng XU Baowen ZHANG Weifeng XU Junling.A New Approach of Feature Selection for Text Categorization[J].Wuhan University Journal of Natural Sciences,2006,11(5):1335-1339.
Authors:Cui Zifeng  Xu Baowen  Zhang Weifeng  Xu Junling
Institution:(1) School of Computer Science and Engineering, Southeast University, 210096 Nanjing, Jiangsu, China;(2) Department of Computer Science and Engineering, Nanjing University of Posts and Telecommunications, 210003 Nanjing, Jiangsu, China
Abstract:This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incomplete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to obtain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.
Keywords:feature selection  independency  CHI-square test  text categorization
本文献已被 CNKI 维普 万方数据 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号