首页 | 本学科首页   官方微博 | 高级检索  
     检索      

利用超链接信息改进网页爬行器的搜索策略
引用本文:赫枫龄,左万利.利用超链接信息改进网页爬行器的搜索策略[J].吉林大学学报(信息科学版),2005,23(1):59-63.
作者姓名:赫枫龄  左万利
作者单位:吉林大学,计算机科学与技术学院,长春,130012
摘    要:网页爬行器在Web空间中爬行时,要面对如下两个问题:1)由于Internet上的信息量十分巨大,网络搜索引擎不可能包含整个Web网页;2)受到硬件资源的限制,它所能存储的网页是有限的.爬行器如果按照传统的宽度优先搜索策略在Web空间中爬行,它对所有的网页都采取一视同仁的态度,这样爬行的结果就导致了它所爬行回来的网页质量不高.为此,给出了利用超链接信息改进网页爬行器搜索策略的算法.该算法充分考虑了网页之间的超链接信息,克服了传统的宽度优先搜索策略的盲目性爬行.实验表明,利用该算法爬行得到的网页与某一特定主题相关的网页超过50%.

关 键 词:爬行器  网络搜索引擎  宽度优先搜索  超链接  利用  超链接信息  改进  网页  爬行器  搜索策略  Crawler  Improve  Information  Strategy  相关  特定主题  实验  算法  质量不高  结果  同仁  宽度优先  有限  存储
文章编号:1671-5896(2005)01-0059-05
修稿时间:2003年12月17日

Using Hyperlink Information to Improve Crawler's Searching Strategy
HE Feng-ling,ZUO Wan-li.Using Hyperlink Information to Improve Crawler''''s Searching Strategy[J].Journal of Jilin University:Information Sci Ed,2005,23(1):59-63.
Authors:HE Feng-ling  ZUO Wan-li
Abstract:A crawler must face two problems when it searches pages in internet. One is that an internet search engine cannot contain entire Web pages due to huge volumes of data in internet. Because of the constraint of hardware resource, the other is that the Web pages stored in the internet search engine are limited. Crawling in Web space according to the strategy of the traditional breadth-first search, if a crawler respects the importance of every page equally, the quality of Web pages collected by the crawler is not high. The algorithm proposed in the present paper makes the best use of hyperlink information contained in the Web pages to the great extent, overcomes the limitations of the blind crawling strategy owned by the traditional breadth-first search. It is improved by using the hyperlink information contained in the Web pages on the base of the traditional breadth-first search. The experiment results show the Web pages crawled by using this algorithm those are relevant to a pre-defined set of topics are over 50%.
Keywords:crawler  internet search engine  breadth-first search  hyperlink
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号