首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于改进的TF-IDF算法的微博话题检测
引用本文:陈朔鹰,金镇晟.基于改进的TF-IDF算法的微博话题检测[J].科技导报(北京),2016,34(2):282-286.
作者姓名:陈朔鹰  金镇晟
作者单位:1. 北京理工大学网络信息中心, 北京 100081;
2. 北京理工大学计算机学院, 北京 100081
摘    要: 中文微博具有更新快、时效性强等特点,产生的热点话题均具有一定的突发性,与此同时文本中有代表性的特征词也会随之激增。利用这一特性,在传统的TF-IDF(term frequency-inverse document frequency)基础上提出一种改进的特征权重算法,称之为TF-IDF-KE(term frequency-inverse document frequency-kinetic energy),用以解决突发性热点话题在聚类时特征不明显的问题。该算法结合物体的动能原理,将特征项的突发值用动能的概念进行描述,加入权值计算,提高突发性特征项的权重,最后使用CURE(clustering using representatives)算法,实现微博的话题检测。该方法描述了文本和特征项所具有的动态属性,实验结果表明,该方法能够有效地提高话题检测的效果。

关 键 词:微博  TF-IDF  话题检测  TDT  文本聚类  
收稿时间:2015-04-23

Weibo topic detection based on improved TF-IDF algorithm
CHEN Shuoying,JIN Zhensheng.Weibo topic detection based on improved TF-IDF algorithm[J].Science & Technology Review,2016,34(2):282-286.
Authors:CHEN Shuoying  JIN Zhensheng
Institution:1. Department of Network Information Center, Beijing Institute of Technology, Beijing 100081, China;
2. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
Abstract:The topic detection and tracking (TDT) is an issue of natural language processing, which concerns with solving the problem of information explosion. The Weibo TDT is a central issue in recent years. A bad performance is usually achieved for Weibo with a short text, while the topic detection of a long text is widely used in the industry with better results. Weibo's features of short text and not very clear meaning make the clustering algorithms' effect not ideal in topic detection. So this paper focuses on finding a new way to improve the effect of clustering for Weibo. Weibo features fast renewal and strong timeliness. Hot topics produced by Weibo show burstiness, and their representative words increase in a great extent. With this feature in mind, improving the representative word's weight to a certain degree is a good way to give a prominence to the feature of short text. The burstiness of the words is a thing to consider, similar to the kinetic theory of the object. The formula of the kinetic energy theorem is used in this paper. Then an improved feature extraction algorithm named the TFIDF-KE (term frequency-inverse document frequency-kinetic energy) is proposed. The new algorithm consists of the kinetic energy and the TF-IDF (term frequency-inverse document frequency). The formula of the kinetic energy theorem is used to evaluate the burstiness of the words and add the value to the formula. Then, the weight of some important words can be improved when extracting features. Finally, the implementation of the CURE (clustering using representatives) algorithm completes the Weibo topic detection task. The method presented in this paper describes burstiness of text and feature and solves the problem that the feature of bursty hot topics is not obvious, when clustering in a certain extent. The experimental results show that the method can effectively improve the effect of topic detection in some degree and a better accuracy rate P can be achieved, as well as the R and F values of the recall rate. So TF-IDF-KE is an effective optimization method and can well be used for the task of the TDT.
Keywords:Weibo  TF-IDF  topic detection  TDT  text clustering  
本文献已被 CNKI 等数据库收录!
点击此处可从《科技导报(北京)》浏览原始摘要信息
点击此处可从《科技导报(北京)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号