首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于元搜索的网页去重算法
引用本文:张玉连,王莎莎,宋桂江.基于元搜索的网页去重算法[J].燕山大学学报,2011,35(2):121-123,161.
作者姓名:张玉连  王莎莎  宋桂江
作者单位:1. 燕山大学信息科学与工程学院,河北秦皇岛,066004
2. 神华黄骅港务公司,河北沧州,061110
摘    要:针对元搜索的重复网页问题,提出基于元搜索的网页去重算法,并通过实验对算法进行有效性验证。该算法首先对各成员搜索引擎返回来的结果网页的URL进行比较,然后对各结果网页的标题进行有关处理,提取出网页的主题信息,再对摘要进行分词,计算摘要的相似度,三者结合能很好的检测出重复网页,实现网页去重。该算法有效,并且比以往算法有明显的优势,更接近人工统计结果。

关 键 词:元搜索  网页  去重  分词

An algorithm of duplicated web pages detection based on meta-search engine
ZHANG Yu-lian,WANG Sha-sha,SONG Gui-jiang.An algorithm of duplicated web pages detection based on meta-search engine[J].Journal of Yanshan University,2011,35(2):121-123,161.
Authors:ZHANG Yu-lian  WANG Sha-sha  SONG Gui-jiang
Institution:ZHANG Yu-lian1,WANG Sha-sha1,SONG Gui-jiang2(1.College of Information Engineering,YanshanUniversity,Qinhuangdao,Hebei 066004,China,2.Huanghua Ports Corporation,Shenhua Group,Cangzhou,Hebei 061110,China)
Abstract:According to the duplicated web pages returning from meta-search engine,an algorithm of deletion of duplicated web pages based on meta-search engine is proposed.The effectiveness of the algorithm is verified through experiments.Firstly,the URL of resultweb pages is compared,which is return by single search engines.Secondly,the titles of resultweb pages are processed, and thematic information of pages is extracted.Finally,the word segmentation on the summary is carried out,and the similarity of the summary i...
Keywords:meta-search engine  web pages  duplicate detection  Chinese word segmentation  
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号