首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于语义相似聚合的主题爬虫算法研究
引用本文:吴林,王永滨.基于语义相似聚合的主题爬虫算法研究[J].中国传媒大学学报,2018,25(1):28-31.
作者姓名:吴林  王永滨
作者单位:中国传媒大学 计算机与网络中心,北京,100024;中国传媒大学 科技处,北京,100024
摘    要:互联网的迅速发展,数据不断增加,使得个性化数据的获取难度越来越大.主题爬虫作为一种垂直检索方式,已经成为一个热门研究领域.传统的主题爬虫往往是通过网页链接之间的关系下载网页,然后再计算下载的网页与给定主题之间的相关关系.传统的主题爬虫一方面割裂了网页链接结构和网页内容主题之间的关系,使得两个部分分开计算; 另一方面下载过程的网页主题相关性不强,会下载大量的主题无关网页.本文提出一种新的基于PageRank 算法主题爬虫算法将网页主题相似度计算与传统的PageRank 算法相结合,将网页链接结构与网页主题相关性结合在一起.另外本文将语义相似性引入到主题爬虫里,实验结果表明本文提出的基于语义相似聚合的主题爬虫算法大大提高了主题爬虫的查全率.

关 键 词:主题爬虫  PageRank算法  语义相似  相似聚合

The Research of the Topic Crawler Algorithm Based on Semantic Similarity Aggregation
WU Lin,WANG Yong-bin.The Research of the Topic Crawler Algorithm Based on Semantic Similarity Aggregation[J].Journal of Communication University of China Science and TEchnology,2018,25(1):28-31.
Authors:WU Lin  WANG Yong-bin
Abstract:With the rapid development of the Internet and the rich supply of increasing data, it is more and more difficult to fetch personalized data. As a vertical search method, topic crawler has become a hot research area. Traditional topic crawlers often download web pages through the links structure of web pages, and then calculate the correlation between the given topic and the downloaded web pages. On the one hand, the traditional topic crawler splits the relationship between the web page structure and the given topic, so that the two parts are separately calculated; on the other hand, the topic of the download process is not very relevant and a large number of topic-unrelated web pages will be downloaded because of the weak correlation between the given topic and the web pages. In this paper, a new topic crawler algorithm based on PageRank algorithm is proposed to combine the correlation between the given topic and the web pages. In addition, this paper introduces semantic similarity into the topic crawler, and the experimental results show that the topic crawler algorithm based on semantic similarity aggregation greatly improves the recall rate.
Keywords:
本文献已被 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号