首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于Map/Reduce的网页消重并行算法
引用本文:张元丰,董守斌,张凌,陈晓志.基于Map/Reduce的网页消重并行算法[J].广西师范大学学报(自然科学版),2007,25(2):153-156.
作者姓名:张元丰  董守斌  张凌  陈晓志
作者单位:华南理工大学,广东省计算机网络重点实验室,广东,广州,510640
基金项目:国家自然科学基金资助项目(90412015)
摘    要:网页消重模块是搜索引擎系统的重要组成部分,其作用是对搜索引擎的爬虫系统下载的网页进行过滤,去除重复内容的网页,从而提高搜索引擎爬虫系统的性能和检索的质量。提出了一种网页消重的并行算法以及基于Map/Reduce的实现机制,并通过实际网站的实验验证了该消重算法的稳定性和处理大量网页时的并行性能。

关 键 词:搜索引擎  网页消重  Map/Reduce
文章编号:1001-6600(2007)02-0153-04
收稿时间:2006-12-15
修稿时间:2006-12-15

Algorithm of Parallelized Elimination of Duplicated Web Pages Based on Map/Reduce
ZHANG Yuan-feng,DONG Shou-bin,ZHANG Ling,CHEN Xiao-zhi.Algorithm of Parallelized Elimination of Duplicated Web Pages Based on Map/Reduce[J].Journal of Guangxi Normal University(Natural Science Edition),2007,25(2):153-156.
Authors:ZHANG Yuan-feng  DONG Shou-bin  ZHANG Ling  CHEN Xiao-zhi
Institution:Communication and Computer Network Key Lab of Guangdong,South China University of Technology, Guangzhou 510640 ,China
Abstract:The module of elimination of duplicated web pages,which filters the web pages downloaded by the crawler module and gets rid of the duplicated pages,is an important part of a search engine.This module can improve the performance of the crawl module and the quality of searching results of a search engine.An algorithm of elimination of duplicated web pages and a strategy based on Map/Reduce are proposed.Its stability and parallel performance in large scale web pages processing is demonstrated when applied to a real web site in our experiment.
Keywords:search engine  elimination of duplicated web pages  Map/Reduce
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号