首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于元信息的云盘资源检索结果去重
引用本文:刘驰,闫宏飞.基于元信息的云盘资源检索结果去重[J].山东大学学报(理学版),2016,51(7):11-17.
作者姓名:刘驰  闫宏飞
作者单位:北京大学网络与信息系统研究所, 北京 100871
基金项目:国家重点基础研究发展计划(973计划)项目(2014CB340400);国家自然科学基金资助项目(61272340,61472013)
摘    要:区别于传统计算网页文本相似度的去重方法,以多媒体数据文件为主的云盘资源仅可利用相当有限的元信息进行检索结果去重。针对这一问题,以搭建的面向云盘资源数据的搜索引擎系统为基础,通过对云盘资源元信息特性的分析,发现除名称之外,资源文件后缀名、占用空间大小、资源的用户归属是判定重复记录的有效特征。在此基础上,给出了处理上述特征的归一化方法,进而使用无监督方法进行去重。实验结果表明,该方法能够有效对云盘资源检索结果去重。

关 键 词:搜索引擎  云盘资源  元信息  去重  
收稿时间:2015-11-14

Deduplicating search results of cloud disk resources using meta-information
LIU Chi,YAN Hong-fei.Deduplicating search results of cloud disk resources using meta-information[J].Journal of Shandong University,2016,51(7):11-17.
Authors:LIU Chi  YAN Hong-fei
Institution:Institute of Network Computing and Information Systems, Peking University, Beijing 100871, China
Abstract:Different from classical duplicate detection methods which calculating text similarity of web pages, the multi-media cloud disk resources only have limited meta-information to deduplicate search results. The research is based on a newly established cloud disk resources search engine. This paper analyzed the characteristic of cloud disk resource meta-information, finding that besides resource names, extension filename, size and ownership are significant features to detect duplicate records. According to this, this paper proposed a feature normalization method and trained an unsupervised method to capture the task. Experiments proved that this method is able to solve the cloud disk resources search results deduplicating problem effectively.
Keywords:search engine  deduplicate  meta-information  cloud disk resources  
本文献已被 CNKI 等数据库收录!
点击此处可从《山东大学学报(理学版)》浏览原始摘要信息
点击此处可从《山东大学学报(理学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号