首页 | 本学科首页   官方微博 | 高级检索  
     

基于LDA的文本分类算法
引用本文:何锦群,刘朋杰. 基于LDA的文本分类算法[J]. 天津理工大学学报, 2014, 0(4): 28-31
作者姓名:何锦群  刘朋杰
作者单位:天津理工大学计算机与通信工程学院移动计算与数据挖掘重点实验室,计算机视觉与系统教育部重点实验室,天津300384
基金项目:国家自然科学基金(61202169;61170027)
摘    要:LDA可以实现大量数据集合中潜在主题的挖掘与文本信息的分类,模型假设,如果文档与某主题相关,那么文档中的所有单词都与该主题相关.然而,在面对实际环境中大规模的数据,这会导致主题范围的扩大,不能对主题单词的潜在语义进行准确定位,限制了模型的鲁棒性和有效性.本文针对LDA的这一弊端提出了新的文档主题分类算法gLDA,该模型通过增加主题类别分布参数确定主题的产生范围,提高分类的准确性.Reuters-21578数据集与复旦大学文本语料库中的数据结果证明,相对于传统的主题分类模型,该模型的分类效果得到了一定程度的提高.

关 键 词:主题模型  LDA  文本分类

The documents classification algorithm based on LDA
HE Jin-qun,LIU Peng-jie. The documents classification algorithm based on LDA[J]. Journal of Tianjin University of Technology, 2014, 0(4): 28-31
Authors:HE Jin-qun  LIU Peng-jie
Affiliation:(School of Computer and Communications Engineering, Tianjin Key Laboratory of Intelligence Computing and Data Mining Technology, Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology, Tianjin 300384, China)
Abstract:Latent Dirichlet Allocation is a classic topic model which can extract latent topic from large data corpus. Model assumes that if a document is relevant to a topic, then all tokens in the document are relevant to that topic. Through narrowing the generate scope that each document generated from, in this paper, we present an improved text classification algorithm for adding topic-category distribution parameter to Latent Dirichlet Allocation. Documents in this model are generated from the category they most relevant. Gibbs sampling is employed to conduct approximate inference. And preliminary experiment is presented at the end of this paper.
Keywords:topic model  LDA  text classification
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号