基于随机森林的文本分类模型研究 Automatic text classification model based on random forest期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于随机森林的文本分类模型研究

引用本文：	张华伟,王明文,甘丽新.基于随机森林的文本分类模型研究[J].山东大学学报(理学版),2006,41(3):5-9.

作者姓名：	张华伟王明文甘丽新

作者单位：	江西师范大学计算机信息工程学院,江西南昌330022

基金项目：	教育部科学技术研究项目;江西省自然科学基金

摘要：	随着WWW的迅猛发展，文本分类成为处理和组织大量文档数据的关键技术．随机森林模型是决策树的集成，并且由一随机向量决定决策树的构造．当森林中的决策树的数目增大，随机森林的泛化误差将趋向一个上界．将随机森林模型应用于文本分类，在Reuter21578数据集上的实验表明，分类效果比较好，性能比较稳定，将共同C4．5，KNN，SM0，SVM4种典型的文本分类器进行了比较，结果显示它的分类性能胜于CA．5，同KNN，SMO和SVM方法相当．
关键词：	文本分类随机森林决策树泛化误差
文章编号：	1671-9352（2006）03-0005-05
收稿时间：	2006-03-29
修稿时间：	2006年3月29日
Automatic text classification model based on random forest

ZHANG Hua-wei,WANG Ming-wen,GAN Li-xin.Automatic text classification model based on random forest[J].Journal of Shandong University,2006,41(3):5-9.

Authors:	ZHANG Hua-wei WANG Ming-wen GAN Li-xin

Institution:	School of Computer Information Engineering, Jiangxi Normal Univ., Nanchang 330027, Jiangxi, China

Abstract:	With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. Random forests are a combination of tree predictors such that each tree depends on the values of random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges to a limit as the number of trees in the forest becomes large. In experiments it is compared to CA. 5, KNN, SMO and SVM, and the results show that its performance is higher than CA.5 and comparable with KNN, SMO and SVM. It is a promising technique for text categorization.

Keywords:	text classification random forest generalization error
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏