首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种新的基于对称性的字符串相似性处理算法
引用本文:王燕,周军锋,汤显,陈子阳,郭景峰.一种新的基于对称性的字符串相似性处理算法[J].燕山大学学报,2014(1):49-56.
作者姓名:王燕  周军锋  汤显  陈子阳  郭景峰
作者单位:[1]燕山大学信息科学与工程学院,河北秦皇岛066004 [2]燕山大学经济管理学院,河北秦皇岛066004
基金项目:基金项目:国家自然科学基金资助项目(61073060,61040023);河北省重点基础研究项目(10963527D);河北省科学技术研究与发展计划科技支撑计划项目(11213578)
摘    要:对于给定的两个字符串集合,基于相似度的连接操作可用于从中找出相似的字符串对,该操作是数据清洗、数据集成以及协同过滤等应用中的核心操作之一,其执行效率直接影响系统的整体性能。本文提出一种高效计算字符串集合间连接操作的算法Trie-TSS,该方法基于trie树进行处理,利用对称性来减少冗余计算。提出一种旨在减少冗余编辑距离计算操作的优化技术来进一步提升系统性能。最后通过实验验证了Trie-TSS算法的高效性。

关 键 词:字符串相似性  trie树  编辑距离  Trie-TSS  优化技术

A new algorithm towards efficient processing of string similarity based on symmetry
WANG Yan,ZHOU Jun-feng,TANG Xian,CHEN Zi-yang,GUO Jing-feng.A new algorithm towards efficient processing of string similarity based on symmetry[J].Journal of Yanshan University,2014(1):49-56.
Authors:WANG Yan  ZHOU Jun-feng  TANG Xian  CHEN Zi-yang  GUO Jing-feng
Institution:College of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei 066004, China; 2. College of Economics and Management, Yanshan University, Qinhuangdao, Hebei 066004, China)
Abstract:For two given sets of strings, join operation is used to find similar string pairs based on string similarity. It is one of the essential operations in many applications, such as data integration, data cleaning, and collaborative filtering. A new trie-based al- gorithm, namely Trie-TSS, which uses the symmetry of edit distance to reduce redundant computation, is proposed. Then a new pruning technique is suggested to further reduce the unnecessary computation so as to improve the overall performance. The ex-perimental results show the efficiency of our method according to various metrics.
Keywords:string similarity  trie  edit distance  Trie-TSS  pruning technique
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号