首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种在线数据清洗方法
引用本文:韩京宇,胡孔法,徐立臻,董逸生.一种在线数据清洗方法[J].应用科学学报,2005,23(3):292-296.
作者姓名:韩京宇  胡孔法  徐立臻  董逸生
作者单位:东南大学计算机科学与工程系, 江苏南京 210096
基金项目:江苏省十五高科技资助项目(BG2001013)
摘    要:提出一种新的在线数据清洗方法:将确认为干净的参照表中的记录字符串映射成高维空间中的点后进行聚类划分,然后利用B+树对划分中的点进行索引从而将高维空间的查询转换成一维空间的范围查询.输入表中的元组利用索引采用分枝限界策略搜索KNN (K nearest neighbors)记录从而完成与其最匹配记录的识别.理论分析和实验表明这是一种解决在线数据清洗的有效途径.

关 键 词:数据清洗  分枝限界  B+树  
文章编号:0255-8297(2005)03-0292-05
收稿时间:2004-03-05
修稿时间:2004-10-13

An Online Data Cleaning Method
HAN Jing-yu,HU Kong-fa,XU Li-zhen,DONG Yi-Sheng.An Online Data Cleaning Method[J].Journal of Applied Sciences,2005,23(3):292-296.
Authors:HAN Jing-yu  HU Kong-fa  XU Li-zhen  DONG Yi-Sheng
Institution:Department of Computer Science and Engineering, Southeast University, Nanjing 210096, China
Abstract:A new method for online data cleaning is presented. First, each clean record in the reference table is mapped as a point in a high-dimensional metric space measured by Manhattan distance. Next, all the points in the space are partitioned by clustering and indexed with (B ) tree. In this way, the search in high-dimensional space can be translated into search in one-dimensional space. To find the KNN (K nearest neighbors) in reference table for each incoming record, the search method of branch and bound is employed. The top K records that best match the incoming record are then identified. Theory and experiment show that it is an effective approach for online data cleaning.
Keywords:data cleaning  branch and bound  (B ) tree
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《应用科学学报》浏览原始摘要信息
点击此处可从《应用科学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号