首页 | 本学科首页   官方微博 | 高级检索  
     

基于随机森林的文本分类模型研究
引用本文:张华伟,王明文,甘丽新. 基于随机森林的文本分类模型研究[J]. 山东大学学报(理学版), 2006, 41(3): 5-9
作者姓名:张华伟  王明文  甘丽新
作者单位:江西师范大学计算机信息工程学院,江西南昌330022
基金项目:教育部科学技术研究项目;江西省自然科学基金
摘    要:随着WWW的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术.随机森林模型是决策树的集成,并且由一随机向量决定决策树的构造.当森林中的决策树的数目增大,随机森林的泛化误差将趋向一个上界.将随机森林模型应用于文本分类,在Reuter21578数据集上的实验表明,分类效果比较好,性能比较稳定,将共同C4.5,KNN,SM0,SVM4种典型的文本分类器进行了比较,结果显示它的分类性能胜于CA.5,同KNN,SMO和SVM方法相当.

关 键 词:文本分类  随机森林  决策树  泛化误差
文章编号:1671-9352(2006)03-0005-05
收稿时间:2006-03-29
修稿时间:2006-03-29

Automatic text classification model based on random forest
ZHANG Hua-wei,WANG Ming-wen,GAN Li-xin. Automatic text classification model based on random forest[J]. Journal of Shandong University, 2006, 41(3): 5-9
Authors:ZHANG Hua-wei  WANG Ming-wen  GAN Li-xin
Affiliation:School of Computer Information Engineering, Jiangxi Normal Univ., Nanchang 330027, Jiangxi, China
Abstract:With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. Random forests are a combination of tree predictors such that each tree depends on the values of random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges to a limit as the number of trees in the forest becomes large. In experiments it is compared to CA. 5, KNN, SMO and SVM, and the results show that its performance is higher than CA.5 and comparable with KNN, SMO and SVM. It is a promising technique for text categorization.
Keywords:text classification   random forest   generalization error
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号