首页 | 本学科首页   官方微博 | 高级检索  
     检索      

移动互联网用户行为分析系统中聚焦爬虫的设计与实现
引用本文:邓炳光,郭慧兰,张治中.移动互联网用户行为分析系统中聚焦爬虫的设计与实现[J].重庆邮电大学学报(自然科学版),2015,27(3):359-365.
作者姓名:邓炳光  郭慧兰  张治中
作者单位:重庆邮电大学通信网测工程研究中心,重庆,400065
基金项目:国家高技术研究发展计划(“863”计划)(2014AA01A706 );国家科技重大专项(2014ZX03001027,2012ZX03001021);重庆高校创新团队(KJTD201312);重庆市教委成果转化重大项目(KJZH14103)
摘    要:在移动互联网用户行为分析系统中,为了使深度包检测(deep packet inspection,DPI)进行有效地数据匹配,对用户行为进行更深层次的分析,达到不仅能识别出业务网站类型而且还识别出业务网站访问具体内容的目标,设计出一种能进行具体内容级别上特征爬取和提炼的爬虫模块.针对特定业务网站,考虑广义爬取对技术和存储要求高的缺点以及针对某一行业的爬虫系统得到数据有限的不足,设计并实现了一种基于特定页面分析的聚焦爬虫模块.该爬虫模块采用模块化的思想,使用多线程多任务,精确高效地爬取特定业务网站信息,为DPI匹配提供数据支持.经过测试验证,该爬虫模块达到了预期的要求,可维护性、可扩展性和实时性强,满足移动互联网用户行为分析系统对特征数据提取的需求.

关 键 词:用户行为分析  聚焦爬虫  多线程  正则表达式
收稿时间:2014/12/25 0:00:00
修稿时间:2015/4/12 0:00:00

Design and implementation of focused crawler in mobile internet user behavior analysis system
DENG Bingguang,GUO Huilan and ZHANG Zhizhong.Design and implementation of focused crawler in mobile internet user behavior analysis system[J].Journal of Chongqing University of Posts and Telecommunications,2015,27(3):359-365.
Authors:DENG Bingguang  GUO Huilan and ZHANG Zhizhong
Institution:Communication Networks Testing Engineering Research Center, Chongqing University of Posts and Telecommunications, Chongqing 400065 , P. R. China,Communication Networks Testing Engineering Research Center, Chongqing University of Posts and Telecommunications, Chongqing 400065 , P. R. China and Communication Networks Testing Engineering Research Center, Chongqing University of Posts and Telecommunications, Chongqing 400065 , P. R. China
Abstract:In the mobile Internet user behavior analysis system, to make the DPI (deep packet inspection) data matching effective, to do deeper user behavior analysis, to not only identify the business site types but also identify the specific content of business site visiting, a crawler module that can extract and refine specific content was designed. For specific business website, considering the disadvantage that generalized crawler need high technology and storage requirements and the deficiency that crawler system for a particular industry only can get limited data, a focused crawler module based on the analysis of the specific page was designed and implemented. The crawler module adopts the idea of modular, uses multithreading and multi-tasking, accurately and efficiently gets data information from specific business website, and provides data support for the DPI matching. Trough testing, the crawler module has reached the expected requirement, has strong maintainability, expansibility and timeliness, and meets the demand of feature data extraction in the mobile Internet user behavior analysis system.
Keywords:user behavior analysis  focused crawler  multithreading  regular expressions
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《重庆邮电大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《重庆邮电大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号