首页 | 本学科首页   官方微博 | 高级检索  
     

网页去重策略
引用本文:高凯,王永成,肖君. 网页去重策略[J]. 上海交通大学学报, 2006, 40(5): 775-777,782
作者姓名:高凯  王永成  肖君
作者单位:上海交通大学,计算机科学与工程系,上海,200030;上海远程教育集团,上海,200086
基金项目:国家高技术研究发展计划(863)项目(2002AA119050)
摘    要:提出基于同源网页去重与内容去重的策略.通过对网址URL进行哈希散列完成对同源网页的去重,并对内容相同或近似的网页采用基于主题概念的去重判断.实验表明,该方法有效且去重效果良好.基于上述算法实现了教育资源库教育资讯搜索引擎系统.

关 键 词:信息检索  搜索引擎  哈希函数  主题概念
文章编号:1006-2467(2006)05-0775-03
收稿时间:2005-06-25
修稿时间:2005-06-25

The Strategy on Processing Replicated Web Collections
GAO Kai,WANG Yong-cheng,XIAO Jun. The Strategy on Processing Replicated Web Collections[J]. Journal of Shanghai Jiaotong University, 2006, 40(5): 775-777,782
Authors:GAO Kai  WANG Yong-cheng  XIAO Jun
Affiliation:1. Dept. of Computer Science and Eng. , Shanghai Jiaotong Univ. , Shanghai 200030, China; 2. Shanghai Distance Education Group, Shanghai 200086
Abstract:This paper presented techniques on how to build an effective crawler to collect non-replicative Web pages.A novel Hash function was proposed,together with a content-oriented approach,to filter based on URLs and contents.On one hand,this technique can parallelize crawling process while minimize the(overlap) effectively.On the other hand,it can identify those near-duplicated collections.The experimental results show the feasibility of the approach.On the basis of this work,the implementation of an educational search engine was presented in the end.
Keywords:information retrieval  search engine  Hash function  subject concept
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号