首页 | 本学科首页   官方微博 | 高级检索  
     检索      

实体解析中基于相似性传递的增量分组研究
引用本文:高广尚.实体解析中基于相似性传递的增量分组研究[J].系统工程理论与实践,2019,39(5):1287-1297.
作者姓名:高广尚
作者单位:1. 桂林理工大学 现代企业管理研究中心, 桂林 541004;2. 桂林理工大学 商学院, 桂林 541004
基金项目:国家自然科学基金(71761008);广西高校人文社会科学重点研究基地基金(16YB010)
摘    要:本文探讨一种适应于大数据集的基于相似性传递的记录增量分组方法.论文首先分析如何逐步推算出记录之间的相似性,然后提出如何基于排序键构建基准组,如何基于相似性传递增量更新基准组,以及如何基于并查集实现基准组中的增量更新,最后通过实验验证提出方法的可行性和高效性.实验结果显示,提出的方法比传统方法更能提高分组质量,提升分组效率.论文没有对属性值本身存在的数据质量问题进行详细分析研究,并没有设计排序键生成算法.提出的方法不仅能有助于解决数据清洗、信息集成与管理等技术中的记录漏配问题,而且具有较好的可扩展性可重用性和不受领域限制等优点因为它仅从纯数据处理的角度来设计算法.

关 键 词:排序键  相似性传递  并查集  实体解析  数据质量  
收稿时间:2018-01-09

Research on incremental grouping based on transferred similarity in entity resolution
GAO Guangshang.Research on incremental grouping based on transferred similarity in entity resolution[J].Systems Engineering —Theory & Practice,2019,39(5):1287-1297.
Authors:GAO Guangshang
Institution:1. Research Center for Modern Enterprise Management, Guilin University of Technology, Guilin 541004, China;2. School of Management, Guilin University of Technology, Guilin 541004, China
Abstract:This paper investigates an approach to record incremental grouping based on transferred similarity for large data sets. The paper first analyzes how to gradually calculate similarity between records, then proposes how to construct reference group based on sorting key, how to incrementally update reference group based on transferred similarity, and how to perform incremental updates in reference group based on union-find, finally proves the feasibility and efficiency of the proposed method through experiments. Experimental results show that the proposed method can improve grouping quality and improve grouping efficiency more than traditional methods. There is no detailed analysis of the data quality problem existing in the attribute value itself, and there is no design of the sorting key generation algorithm. The proposed method can not only help solve the problem of missing record pairs in data cleaning, information integration and management, but also has advantages such as better scalability, reusability, and freedom from the domain, because it only designs algorithms from the perspective of pure data processing.
Keywords:sorting key  transferred similarity  union-find  entity resolution  data quality  
本文献已被 CNKI 等数据库收录!
点击此处可从《系统工程理论与实践》浏览原始摘要信息
点击此处可从《系统工程理论与实践》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号