首页 | 本学科首页   官方微博 | 高级检索  
     

一种新的加权后缀树Web文档聚类方法
引用本文:杨瑞龙,朱庆生,谢洪涛,屈洪春. 一种新的加权后缀树Web文档聚类方法[J]. 系统仿真学报, 2011, 23(3): 474-479
作者姓名:杨瑞龙  朱庆生  谢洪涛  屈洪春
作者单位:重庆大学计算机学院,重庆,400044
基金项目:国家科技支撑计划(2007BAH08B04); 重庆市科技支撑计划(2008AC20084)
摘    要:针对Web文档的结构及其特征,提出了一种新的加权后缀树聚类方法WSTC。首先,根据Web文档的HTML标签,把文档划分为具备不同重要性等级的段,段划分成句子,句子分割为词。其次,用句子替代文档构造后缀树,把其重要性等级作为结构权融入后缀树的节点,形成文档集的加权后缀树模型。最后,在选择和合并基类过程中,综合利用节点包含的文档数、句子数、短语长度和结构权。仿真实验表明,WSTC算法比传统STC算法取得了更好的聚类效果。

关 键 词:后缀树  后缀树聚类  Web文档聚类  Web文档结构  权重计算

Novel Weighted Suffix Tree Clustering for Web Documents
YANG Rui-long,ZHU Qing-sheng,XIE Hong-tao,QU Hong-chun. Novel Weighted Suffix Tree Clustering for Web Documents[J]. Journal of System Simulation, 2011, 23(3): 474-479
Authors:YANG Rui-long  ZHU Qing-sheng  XIE Hong-tao  QU Hong-chun
Affiliation:YANG Rui-long,ZHU Qing-sheng,XIE Hong-tao,QU Hong-chun(College of Computer Science,Chongqing University,Chongqing 400044,China)
Abstract:For Web documents clustering,a novel Weighted Suffix Tree Clustering(WSTC) method was proposed.First,according to the structure and HTML tags of Web documents,different parts of documents were assigned different levels of significance as structure weights;each part was partitioned into some sentences which were partitioned into some words.Second,the weighted suffix tree of documents set was built with sentences and structure weights stored in the nodes.Finally,the documents count,sentences count,phrase leng...
Keywords:suffix tree  suffix tree clustering  web document clustering  web document structure  weight computing  
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号