首页 | 本学科首页   官方微博 | 高级检索  
     

关系型数据库数据的高效判重
引用本文:李恒新, 韩坚华*. 关系型数据库数据的高效判重[J]. 华南师范大学学报(自然科学版), 2015, 47(1): 121-126. DOI: 10.6054/j.jscnun.2014.11.004
作者姓名:李恒新  韩坚华
作者单位:1.(广东工业大学计算机学院,广州 510006)
摘    要:对Simhash算法进行改进,用CityHash函数生成数据指纹特征值,以此对数据进行判重.在广州市某区政府的信访业务真实数据下进行了实验,实验结果相对其他算法具有较高的召回率和准确率.并提出了一种索引归类方法来提高全部数据一次性相似检测的速度,在MongoDB数据库存储指纹值的前提下,为增量数据的高效判重处理提供了保障.通过对数据的整个判重过程的改进,达到了高效、实用的价值,为科学办案、重复办案提供了参考依据.

关 键 词:Simhash  CityHash  MongoDB  指纹特征值  相似检测
收稿时间:2014-09-19

Efficient Duplicate Detection for Data in Relational Databases
Li Hengxin, Han Jianhua*. Efficient Duplicate Detection for Data in Relational Databases[J]. Journal of South China Normal University (Natural Science Edition), 2015, 47(1): 121-126. DOI: 10.6054/j.jscnun.2014.11.004
Authors:Li Hengxin  Han Jianhua
Affiliation:1.School of Computer Science,Guangdong University of Technology,Guangzhou 510006,China
Abstract:With the growth of data in traditional relational databases, the probability of the similar data is increasing greatly. By using CityHash function to get fingerprint characteristic value, the Simhash algorithm is improved in order to detect the duplicate data. It has been tested by real data from petition business in the district government of Guangzhou city, the results show that it has higher recall and precision than other algorithms. Moreover, an index classification method to improve the speed of similarity detection for all data is presented. Meanwhile, the method provides a guarantee for the efficient processing of incremental data on the premise of the fingerprint values stored by MongoDB database. It also improves the whole process of similarity detection and provides a reference for scientific investigators.
Keywords:Simhash  CityHash  MongoDB  fingerprint characteristic value  similarity detection
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《华南师范大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《华南师范大学学报(自然科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号