首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种基于聚类树的增量式数据清洗算法
引用本文:刘芳,何飞.一种基于聚类树的增量式数据清洗算法[J].华中科技大学学报(自然科学版),2005,33(3):46-48.
作者姓名:刘芳  何飞
作者单位:华中科技大学,计算机科学与技术学院,湖北,武汉,430074
基金项目:国家“十五”重大科技基金资助项目 (2 0 0 1BA10 2A0 6 11) .
摘    要:研究了在数据模式与匹配规则不变的前提下 ,数据集动态增加时近似重复记录的识别问题 ,提出了一种基于聚类树的增量式数据清洗算法IACT .该算法通过构建聚类树先对记录进行分区 ,然后在划分的区域内进行相似度的计算识别出近似重复记录 ,从而完成了增量式相似重复记录的检测 .实验结果证明了IACT算法在无损精度的情况下 ,在效率上优于多趟邻近排序 (MPN)算法 .

关 键 词:数据清洗  近似重复记录  聚类树
文章编号:1671-4512(2005)03-0046-03
修稿时间:2004年7月16日

An incremental algorithms of data cleansing based on clustering tree
Liu Fang,He Fei.An incremental algorithms of data cleansing based on clustering tree[J].JOURNAL OF HUAZHONG UNIVERSITY OF SCIENCE AND TECHNOLOGY.NATURE SCIENCE,2005,33(3):46-48.
Authors:Liu Fang  He Fei
Institution:Liu Fang He Fei Liu Fang Dr., College of Computer Sci. & Tech.,Huazhong Univ. of Sci.& Tech.,Wuhan 430074,China.
Abstract:This paper studied the problem of detecting approximately duplicate records while receiving increments of data with no changes in data schema and matching rule set, and presented an incremental algorithm IACT (Incremental Algorithms based on Clustering Trees for data cleansing). IACT divided the data records into a few areas and computed their similarity to identify the approximately duplicate records to accomplish the data cleansing task in the partitioned areas through creating clustering tree. Compared with the algorithm MPN, the experimental result proves that IACT algorithm is more effective while possessed of the same precision.
Keywords:data cleansing  approximately duplicate record  clustering tree
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号