首页 | 本学科首页   官方微博 | 高级检索  
     

一种提高文本聚类算法质量的方法
引用本文:冯少荣. 一种提高文本聚类算法质量的方法[J]. 同济大学学报(自然科学版), 2008, 36(12)
作者姓名:冯少荣
作者单位:厦门大学信息科学与技术学院,福建,厦门,361005
基金项目:国家自然科学基金资助项目  
摘    要:针对基于VSM(vector space model)的文本聚类算法存在的主要问题,即忽略了词之间的语义信息、忽略了各维度之间的联系而导致文本的相似度计算不够精确,提出基于语义距离计算文档间相似度及两阶段聚类方案来提高文本聚类算法的质量.首先,从语义上分析文档,采用最近邻算法进行第一次聚类;其次,根据相似度权重,对类特征词进行优胜劣汰;然后进行类合并;最后,进行第二次聚类,解决最近邻算法对输入次序敏感的问题.实验结果表明,提出的方法在聚类精度和召回率上均有显著的提高,较好解决了基于VSM的文本聚类算法存在的问题.

关 键 词:文本聚类  语义距离  最近邻聚类  相似度  聚类算法

A Method to Improve Text Clustering Algorithm Quality
FENG Shaorong. A Method to Improve Text Clustering Algorithm Quality[J]. Journal of Tongji University(Natural Science), 2008, 36(12)
Authors:FENG Shaorong
Abstract:The main problem with the text clustering algorithm based on vector space model(VSM) is that semantic information between words and the link between the various dimensions are overlooked,resulting in inaccuracy in the text similarity calculation.A method based on computing the text similarity using semantic distance and two-phrase clustering is proposed to improve the text clustering algorithm.First,the text analyzed according to its semantic,with nearest neighbor algorithm used for the first cluster.Some feature words are chosen according to the similarity weight to represent the cluster with the remaining feature words similar to the main themes of the cluster,and then class combination is carried out.Finally,the second clustering is carried out to improve the nearest neighbor clustering which is sensitive to the input order of the document.Simulation experiments indicate that the proposed algorithm can solve these problems and performs better than the text clustering algorithm based on VSM in the clustering precision and recall rate.
Keywords:text clustering  semantic distance  nearest neighbor clustering  similarity  clustering algorithm
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《同济大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《同济大学学报(自然科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号