首页 | 本学科首页   官方微博 | 高级检索  
     


Redundancy Elimination in Multi-signature Based Parallel Entity Resolution
Abstract:The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching.
Keywords:
本文献已被 CNKI 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号