首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种基于主题的Web文本聚类算法
引用本文:袁晓峰.一种基于主题的Web文本聚类算法[J].成都大学学报(自然科学版),2010,29(3):249-252.
作者姓名:袁晓峰
作者单位:盐城师范学院,信息科学与技术学院,江苏,盐城,224002
摘    要:设计了一种基于主题的Web文本聚类方法(HTBC):首先根据文本的标题和正文提取文本的主题词向量,然后通过训练文本集生成词聚类,并将每个主题词向量归类到其应属的词类,再将同属于一个词类的主题词向量对应的文本归并到用对应词类的名字代表的类,从而达到聚类的目的.算法分四个步骤:预处理、建立主题向量、生成词聚类和主题聚类.同时,对HTBC与STC、AHC、KMC算法从聚类的准确率和召回率上做了比较,实验结果表明,HTBC算法的准确率较STC、AHC和KMC算法要好.

关 键 词:HTBC算法  Web文本聚类  主题  搜索引擎  互信息

A Clustering Algorithm for Web Document Based on Theme
YUAN Xiaofeng.A Clustering Algorithm for Web Document Based on Theme[J].Journal of Chengdu University (Natural Science),2010,29(3):249-252.
Authors:YUAN Xiaofeng
Institution:YUAN Xiaofeng(School of Information Science and Technology,Yangcheng Normal University,Yancheng 224002,China)
Abstract:A clustering method-HTBC was devised based on theme.It extracts the Keywords according to the title and the main body of the document,trains the text sets to generate the word clustering,classifies each keyword to responding word cluster,combines the same thesis attribute to word cluster and finally realizes clustering.There are four steps for HTBC such as pretreatment,constructing the theme vector,generating the word cluster and theme clustering.The experimental data indicate HTBC are better than K-Means,AHC and STC in terms of accuracy and recall ratio after comparision.
Keywords:HTBC  Web document clustering  theme  search engine  mutual information
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号