首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种改进的面向VLDB数据质量处理算法
引用本文:王咏梅,嵇晓,汪恒杰,冯安平.一种改进的面向VLDB数据质量处理算法[J].科技咨询导报,2009(2):43-45.
作者姓名:王咏梅  嵇晓  汪恒杰  冯安平
作者单位:[1]上海工程技术大学高职学院,上海200437 [2]上海宝信软件,上海201203
摘    要:数据质量问题是企业在构建商务智能系统中遇到的最重要的问题之一,在处理面向VLDB数据质量的时候,对模糊重复记录的识别和整合非常困难。文章中提出了一种改进的面向VLDB数据质量处理算法,即先通过基于聚类的N-gram的改进算法来检测相似重复记录,采用pair-wise来计算相似重复度,用一个固定大小的优先队列窗口来聚类相似重复记录,同时引入转换关闭准则生成一种多路聚类方法,提高聚类的准确度。本文的算法在语言识别和关键字检测方面获得高于90%的准确率。

关 键 词:数据质量  聚类  多通道方法

An Improved Arithmetic for Data Quality in Very Large Database
Wang Yongmei Ji Xiao Wang Hengjie Feng Anping.An Improved Arithmetic for Data Quality in Very Large Database[J].Science and Technology Consulting Herald,2009(2):43-45.
Authors:Wang Yongmei Ji Xiao Wang Hengjie Feng Anping
Institution:Wang Yongmei Ji Xiao Wang Hengjie Feng Anping (1 .Shanghai University of Engineering Science Advanced Vocational Technical College, Shanghai 200437; 2. Shanghai Baosight Software Company Shanghai 201203)
Abstract:Data quality problem is very important in design of business intelligence system. It is difficult to detect and eliminate duplications when processing data quality questions in very large database. This article proposes an improved arithmetic for very large database. First an efficient N-Gram based clustering algorithm is adopted to detect duplicated records. And then apply Pair-Wise comparison algorithm to the inspection of the exact degree of the similar records. For detecting approximately duplicate records, an improved algorithm that employs the priority queue is presented; at the same time, a transitive-closure phase based multi-pass clustering is proposed to improve the data accuracy. The algorithm offered in this article acquires more than 90% accuracy in both language identification and keyword detection.
Keywords:data quality duster multiple-pass method
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号