首页 | 本学科首页   官方微博 | 高级检索  
     检索      

Dtrie-allpair:高效的集合T-覆盖连接算法
引用本文:贾连印,奚建清,李孟娟,游进国,刘勇,苗德成.Dtrie-allpair:高效的集合T-覆盖连接算法[J].华南理工大学学报(自然科学版),2012,40(6):109-117.
作者姓名:贾连印  奚建清  李孟娟  游进国  刘勇  苗德成
作者单位:1. 华南理工大学计算机科学与工程学院,广东广州,510006
2. 云南师范大学图书馆,云南昆明,650092
3. 昆明理工大学信息工程与自动化学院,云南昆明,650500
基金项目:广东省科技计划项目,云南省应用基础研究项目
摘    要:传统的T-覆盖连接算法会因生成的候选集庞大而导致系统性能降低,为此,文中提出了一种基于trie的动态索引结构——DTI结构,并构建了基于该结构的相似度连接算法——Dtrie-allpair算法.通过该算法可以直接得到allpair连接的结果,不产生任何候选集,有效解决了高候选集产生的问题,克服了传统算法因生成并验证候选集而带来的开销.文中还研究了数据库中记录的顺序及记录中元素顺序对Dtrie-allpair算法性能的影响,并在msweb、msnbc两个数据集下对Dtrie-allpair算法与All-pair、PPJoin算法进行对比.结果表明:Dtrie-allpair算法具有明显的优势,覆盖阈值较小时优势更明显;对msweb数据集,阈值为2时,Dtrie-allpair算法的效率相对于All-pair、PPJoin算法提高近两个数量级;通过对数据集进行频率降序和长度升序组合预处理可大幅降低Dtrie-allpair算法访问的trie结点数量,从而显著提升性能.

关 键 词:集合相似度  T-覆盖连接  覆盖阈值  基于trie的动态索引  All-pair算法  PP-Join算法  频率降序  长度升序

Dtrie-allpair:an Efficient Set T-Overlap Join Algorithm
Jin Lian-Yin , Xi Jian-Qing , Li Meng-Juan , You Jin-Guo , Liu Yong , Miao De-Cheng.Dtrie-allpair:an Efficient Set T-Overlap Join Algorithm[J].Journal of South China University of Technology(Natural Science Edition),2012,40(6):109-117.
Authors:Jin Lian-Yin  Xi Jian-Qing  Li Meng-Juan  You Jin-Guo  Liu Yong  Miao De-Cheng
Institution:1(1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China; 2.Library of Yunnan Normal University,Kunming 650092,Yunnan,China;3.Faculty of Information Engineering and Automation, Kunming University of Science and Technology,Kunming 650500,Yunnan,China)
Abstract:As the traditional T-overlap join algorithms generate a huge number of candidates and thus degrade the system performance inevitably,a dynamic trie-based index,DTI,is designed.Based on DTI,a novel similarity join algorithm named Dtrie-allpair is proposed.Dtrie-allpair helps to directly obtain join results without generating candidates and avoids additional overhead.Then,the effects of the order of records in the database and the order of elements in the records on the performance of Dtrie-allpair are investigated.Some experiments are carried out on msweb and msnbc to compare Dtrie-allpair with such algorithms as All-pair and PPJoin.The results show that(1) Dtrie-allpair obviously outperforms All-pair and PPJoin,especially at low overlap thresholds;(2) at an overlap threshold of 2,the efficiency of Dtrie-allpair is about two orders of magnitude higher than that of All-pair and PPJoin;(3) the preprocess of the dataset with the combination of frequency-descending order and length-ascending order greatly reduces the number of accessed trie nodes and significantly improves the efficiency of Dtrie-allpair.
Keywords:set similarity  T-overlap join  overlap threshold  trie-based dynamic index  All-pair algorithm  PPJoin algorithm  frequency-descending order  length-ascending order
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号