首页 | 本学科首页   官方微博 | 高级检索  
     检索      

Nutch中网页更新预测研究与优化
引用本文:胡伟,吴海涛.Nutch中网页更新预测研究与优化[J].上海师范大学学报(自然科学版),2016,45(4):448-457.
作者姓名:胡伟  吴海涛
作者单位:上海师范大学,上海师范大学
摘    要:Nutch的网页更新预测方法采用的是邻比法,相关更新参数需要人为设定,不能自适应调整,无法应对海量网页更新的差异性.为解决这个问题,提出动态选择策略对Nutch的网页更新预测方法进行改进.该策略在网页更新历史数据不足时,通过基于MapReduce的DBSCAN聚类算法来减少爬虫系统抓取网页数量,将样本网页的更新周期作为所属类其他网页的更新周期;在网页更新历史数据较多时,通过对网页更新历史数据进行泊松过程建模,较准确地预测每个网页的更新周期.最后在Hadoop分布式平台下对改进该策略测试.实验结果表明,优化后的网页更新预测方法表现更优.

关 键 词:Nutch    网页更新预测    基于密度聚类算法    泊松过程    分布式编程
收稿时间:2015/3/27 0:00:00

Research and optimization of page updated forecast on Nutch
HU Wei and WU Haitao.Research and optimization of page updated forecast on Nutch[J].Journal of Shanghai Normal University(Natural Sciences),2016,45(4):448-457.
Authors:HU Wei and WU Haitao
Institution:College of Informiation,Mechanical and Electrical Engineering,Shanghai Normal University and College of Informiation,Mechanical and Electrical Engineering,Shanghai Normal University
Abstract:Web page updated prediction method of Nutch is an adjacent method and its relevant update parameters need to be set artificially,not adaptively adjustable,and unable to cope with the differences of massive web page updates.To address this problem,this paper puts forward dynamic selection strategy to improve the method of Nutch web page updated prediction.When the historical updated web page data are insufficient,the strategy uses DBSCAN clustering algorithm based on MapReduce to reduce the number of the pages of the crawler system crawling,the update cycle of the sample web pages is used as update cycle of other pages which are in the same category.When the historical updated web page data are enough,the data are used to model with the Poisson Process,which can more accurately predict each web page update cycle.Finally the improving strategy is tested in the Hadoop distributed platform.The experimental results show that the performance of optimized web page updated prediction method is better.
Keywords:Nutch  web page updated prediction  DBSCAN  poisson process  mapReduce
本文献已被 CNKI 等数据库收录!
点击此处可从《上海师范大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《上海师范大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号