首页 | 本学科首页   官方微博 | 高级检索  
     检索      

New focused crawling algorithm
作者姓名:Su Guiyang  Li Jianhua  Ma Yinghua  Li Shenghong  Song Juping
作者单位:Su Guiyang,Li Jianhua,Ma Yinghua,Li Shenghong,Song JupingDepartment of Electronic Engineering,Shanghai Jiaotong University,Shanghai 200030,P. R. China
基金项目:ThisprojcetwassupportedbytheNationalHighTechnologyResearchandDevelopmentProgramofChina(“863“Program)(2001AA142160,2002AA145090).
摘    要:1.INTRODUCTION Searchengineisamostwidelyusedinformationre trivaltool.AccordingtoCNNIC’s“SurveyReporon theDevelopmentofChina’sInternet”,84.6%users findnewwebsitesbysearchengine.Internet’s growthspeedisfantastic.Theworldfirstsearchen gine———Googleclaimsithasindexedaboutthreebil lion(3,083,324,652)webpages.Internetisnow stillexpandingwithabout600GBcontentschangedor addedpermonth1](oreverymonth). Suchagrowthandfluxposebasiclimitsofscale totoday’sgenericcrawlersandsearchen…


New focused crawling algorithm
Su Guiyang,Li Jianhua,Ma Yinghua,Li Shenghong,Song Juping.New focused crawling algorithm[J].Journal of Systems Engineering and Electronics,2005,16(1).
Authors:Su Guiyang  Li Jianhua  Ma Yinghua  Li Shenghong  Song Juping
Institution:Department of Electronic Engineering, Shanghai Jiaotong University, Shanghai 200030, P. R. China
Abstract:Focused carawling is a new research approach of search engine. It restricts information retrieval and provides search service in specific topic area. Focused crawling search algorithm is a key technique of focused crawler which directly affects the search quality. This paper first introduces several traditional topic-specific crawling algorithms, then an inverse link based topic-specific crawling algorithm is put forward. Comparison experiment proves that this algorithm has a good performance in recall, obviously better than traditional Breadth-First and Shark-Search algorithms. The experiment also proves that this algorithm has a good precision.
Keywords:focused crawling  search engine  precision  recall  
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号