网页去重策略 The Strategy on Processing Replicated Web Collections期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

网页去重策略

引用本文：	高凯,王永成,肖君. 网页去重策略[J]. 上海交通大学学报, 2006, 40(5): 775-777,782

作者姓名：	高凯王永成肖君

作者单位：	上海交通大学,计算机科学与工程系,上海,200030;上海远程教育集团,上海,200086

基金项目：	国家高技术研究发展计划(863)项目(2002AA119050)

摘要：	提出基于同源网页去重与内容去重的策略.通过对网址URL进行哈希散列完成对同源网页的去重,并对内容相同或近似的网页采用基于主题概念的去重判断.实验表明,该方法有效且去重效果良好.基于上述算法实现了教育资源库教育资讯搜索引擎系统.
关键词：	信息检索搜索引擎哈希函数主题概念
文章编号：	1006-2467（2006）05-0775-03
收稿时间：	2005-06-25
修稿时间：	2005-06-25
The Strategy on Processing Replicated Web Collections

GAO Kai,WANG Yong-cheng,XIAO Jun. The Strategy on Processing Replicated Web Collections[J]. Journal of Shanghai Jiaotong University, 2006, 40(5): 775-777,782

Authors:	GAO Kai WANG Yong-cheng XIAO Jun

Affiliation:	1. Dept. of Computer Science and Eng. , Shanghai Jiaotong Univ. , Shanghai 200030, China; 2. Shanghai Distance Education Group, Shanghai 200086

Abstract:	This paper presented techniques on how to build an effective crawler to collect non-replicative Web pages.A novel Hash function was proposed,together with a content-oriented approach,to filter based on URLs and contents.On one hand,this technique can parallelize crawling process while minimize the(overlap) effectively.On the other hand,it can identify those near-duplicated collections.The experimental results show the feasibility of the approach.On the basis of this work,the implementation of an educational search engine was presented in the end.

Keywords:	information retrieval search engine Hash function subject concept
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏