首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种增量式文本软聚类算法
引用本文:冯中慧,鲍军鹏,沈钧毅.一种增量式文本软聚类算法[J].西安交通大学学报,2007,41(4):398-401,411.
作者姓名:冯中慧  鲍军鹏  沈钧毅
作者单位:西安交通大学电子与信息工程学院,710049,西安
摘    要:针对传统文本聚类算法时间复杂度较高,而与距离无关的算法又不适用于动态、变化的文本集等问题,提出了一种基于语义序列的增量式文本软聚类算法.该算法考虑了长文本的多主题特性,并利用语义序列相似关系计算相似语义序列集合的覆盖度,同时将每次选择的具有最小熵重叠值的候选类作为一个结果聚类,这样在整个聚类的过程中大大减小了文本向量空间的维数,缩短了计算时间.由于所提算法的语义序列只与文本自身相关,所以它适用于增量式聚类.实验结果表明,算法的聚类精度高于同条件下的其他聚类算法,尤其适合于长文本集的软聚类.

关 键 词:语义序列  增量式聚类  软聚类  文本聚类
文章编号:0253-987X(2007)04-0398-04
修稿时间:2006-07-05

Incremental Algorithm of Text Soft Clustering
Feng Zhonghui,Bao Junpeng,Shen Junyi.Incremental Algorithm of Text Soft Clustering[J].Journal of Xi'an Jiaotong University,2007,41(4):398-401,411.
Authors:Feng Zhonghui  Bao Junpeng  Shen Junyi
Abstract:Focusing on the problems that the text clustering has high time complexity, the algo- rithms that are independent on the distance are unsuitable for dynamic and changing corpus, and the multi-subject characteristics of a single text cannot be considered in traditional algorithms, an incremental algorithm of text soft clustering based on semantic sequence is proposed, in which the clustering candidate with minimum entropy overlap value is selected as a result cluster by using similarity relation of semantic sequences and calculating the coverage of similarity semantic sequences set. The dimensions of text vector space are decreased dramatically in the clustering procedure, so the computing time can be reduced. Since the semantic sequence is only related to text, it is available for incremental clustering. The comparison of experimental results shows that the algorithm can achieve higher precision than other algorithms under same conditions, especially for soft clustering of long texts set.
Keywords:semantic sequence  incremental clustering  soft clustering  text clustering
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号