基于MapReduce的网络爬虫设计与实现 MapReduce based web crawler design and implementation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于MapReduce的网络爬虫设计与实现

引用本文：	基于MapReduce的网络爬虫设计与实现.基于MapReduce的网络爬虫设计与实现[J].山东科学,2015,28(2):101-107.

作者姓名：	基于MapReduce的网络爬虫设计与实现

作者单位：	山东省科学院情报研究所,山东济南 250014

基金项目：	山东省科学院青年基金（2013QN036）；山东省科技发展计划(2013GGX10127； 2014GGX101013）

摘要：	针对单机爬虫效率低、可扩展性差等问题，本文设计并实现了一种基于MapReduce的网络爬虫系统。该系统首先采用HDFS和HBase对网页信息进行存储管理，基于行块分布函数的方法进行网页信息抽取；然后通过URL和网页信息相似度分析相结合的去重策略，采用Simhash算法对抓取的网页信息进行相似度度量。实验结果表明,该系统具有良好的性能和可扩展性，较单机爬虫相比平均抓取速度提高了4.8倍。
关键词：	网络爬虫信息抽取文本去重 Hadoop MapReduce
收稿时间：	2015-01-21
MapReduce based web crawler design and implementation

LI Chen,ZHU Shi wei,Zhao Yan qing,YU Jun feng.MapReduce based web crawler design and implementation[J].Shandong Science,2015,28(2):101-107.

Authors:	LI Chen ZHU Shi wei Zhao Yan qing YU Jun feng

Institution:	Information Institute,Shandong Academy of Sciences,Jinan 250014,China

Abstract:	We design and implement a MapReduce based web crawler system for such issues as low efficiency and bad scalability of a single crawler system. It employs HDFS and HBase to store web information and extracts web information through a row block distribution function. It then measures similarity for acquired web information by Simhash algorithm and deduplication strategy of similarity analysis of URL and web information. Experimental results show that it has better performance and scalability, and increases average crawling speed by 4.8 times, as compared with single crawling system.

Keywords:	Hadoop information extraction text deduplication MapReduce web crawler
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《山东科学》浏览原始摘要信息
	点击此处可从《山东科学》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏