首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于LDA模型的网络刊物主题发现与聚类
引用本文:杨传春,张冰雪,李仁德,郭强.基于LDA模型的网络刊物主题发现与聚类[J].上海理工大学学报,2019,41(3):273-280.
作者姓名:杨传春  张冰雪  李仁德  郭强
作者单位:上海理工大学 复杂系统科学研究中心, 上海 200093,上海理工大学 MPA教育中心, 上海 200093,上海理工大学 复杂系统科学研究中心, 上海 200093,上海理工大学 复杂系统科学研究中心, 上海 200093
摘    要:随着智能终端的普及,文本的主题挖掘需求也越来越广泛,主题建模是文本主题挖掘的核心,LDA生成模型是基于贝叶斯框架的概率模型,它以语义关联为基础,很好地解决了文本潜在主题的提取问题。对文本聚类过程的核心技术LDA生成模型、数据采样、模型评价等作了较为深入的阐述和解析,结合网络教育平台的2 794篇学习刊物进行了主题发现和聚类实验,建立了包含3 800个词项的词库,通过kmeans算法和合并向量算法(UVM)分两步解决了主题聚类问题。提出了文本挖掘实验的一般方法,并对层次聚类中文本距离的算法提出了改进。实验结果表明,该平台刊物的主题整体相似度比较好,但主题过于集中使得许多刊物的内容不具有辨识度,影响用户对主题的定位。

关 键 词:LDA模型  生成模型  主题发现  层次聚类  文本挖掘
收稿时间:2018/5/30 0:00:00

Topic Discovery and Clustering for Online Journals Based on LDA Algorithm
YANG Chuanchun,ZHANG Bingxue,LI Rende and GUO Qiang.Topic Discovery and Clustering for Online Journals Based on LDA Algorithm[J].Journal of University of Shanghai For Science and Technology,2019,41(3):273-280.
Authors:YANG Chuanchun  ZHANG Bingxue  LI Rende and GUO Qiang
Institution:Research Center of Complex Systems Science, University of Shanghai for Science and Technology, Shanghai 200093, China,MPA Education Center, University of Shanghai for Science and Technology, Shanghai 200093, China,Research Center of Complex Systems Science, University of Shanghai for Science and Technology, Shanghai 200093, China and Research Center of Complex Systems Science, University of Shanghai for Science and Technology, Shanghai 200093, China
Abstract:With the popularity of intelligent terminals, the demand of text topic mining is becoming more prevalent in many different domains. Theme modeling is the kernel of text topic mining. LDA (latent Dirichlet allocation) generating model is a probability model based on Bayesian framework, and it solves the problem of text potential topic extraction based on semantic association. The key technology of text clustering process, including LDA generating model, data sampling, model evaluation, was described and analyzed in depth. Theme discovery and clustering experiments were carried out in 2 794 learning journals on the network education platform. A thesaurus containing 3 800 terms was established. The problem of topic clustering was solved by kmeans algorithm and UVM (union vector method) algorithm in two steps. Meanwhile a general method of text mining experiment was proposed, and the algorithm of text distance in hierarchical clustering was improved. The experimental results show that the overall similarity of topics in the platform is good, but the focus of topics makes the content of many journals not identifiable, which affects the user''s positioning of topics.
Keywords:LDA model  generating model  topic discovery  hierarchical clustering  text mining
本文献已被 CNKI 等数据库收录!
点击此处可从《上海理工大学学报》浏览原始摘要信息
点击此处可从《上海理工大学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号