A New Approach of Feature Selection for Text Categorization A new approach of feature selection for text categorization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

A New Approach of Feature Selection for Text Categorization

作者姓名：	CUI Zifeng XU Baowen ZHANG Weifeng XU Junling

作者单位：	[1]School of Computer Science and Engineering,Southeast University, Nanjing 210096, Jiangsu, China [2]Department of Computer Science and Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003, Jiangsu, China

摘要：	This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization （TC）, is discussed. However, the basic hypothesis is incom plete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to ob rain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC （which is only taken into the relevance account）, the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.
关键词：	特征选择独立 CHI测试文本分类
文章编号：	1007-1202（2006）05-1335-05
收稿时间：	2006-03-10
A new approach of feature selection for text categorization

CUI Zifeng XU Baowen ZHANG Weifeng XU Junling.A New Approach of Feature Selection for Text Categorization[J].Wuhan University Journal of Natural Sciences,2006,11(5):1335-1339.

Authors:	Cui Zifeng Xu Baowen Zhang Weifeng Xu Junling

Institution:	(1) School of Computer Science and Engineering, Southeast University, 210096 Nanjing, Jiangsu, China;(2) Department of Computer Science and Engineering, Nanjing University of Posts and Telecommunications, 210003 Nanjing, Jiangsu, China

Abstract:	This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incomplete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to obtain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.

Keywords:	feature selection independency CHI-square test text categorization
本文献已被 CNKI 维普万方数据 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏