首页 | 本学科首页   官方微博 | 高级检索  
     

民用建筑“四节一环保”数据的清洗与修复方法研究
引用本文:申鸿怡,徐芳芳,王新民. 民用建筑“四节一环保”数据的清洗与修复方法研究[J]. 北京大学学报(自然科学版), 2020, 56(5): 785-795. DOI: 10.13209/j.0479-8023.2020.019
作者姓名:申鸿怡  徐芳芳  王新民
作者单位:1. 北京大学前沿交叉学科研究院大数据科学研究中心, 北京 1008712. 山东科技大学数学与系统科学学院, 青岛 2665903. 北京大学数学科学学院, 北京 100871
基金项目:国家重点研发计划(2018YFC0704300)和国家自然科学基金(11901359)资助
摘    要:针对民用建筑"四节一环保"原始数据中存在的数据质量问题,使用多种方法实现数据清洗与数据修复。数据清洗方面,重点关注单栋建筑能耗数据中存在的相似重复记录及异常记录。其中,识别异常记录采用3σ准则、DBSCAN聚类算法及箱线图内限3种方法。数据修复方面,重点关注缺失值的填补及基于模型的数据修正。其中,缺失值的填充使用简单填充、线性回归模型和基于用户的协同过滤推荐算法,并以平均绝对误差为评估指标进行对比。基于多元线性回归、主成分回归、偏最小二乘回归、岭回归及Lasso回归5种模型,拟合建筑运行能耗与各解释变量间的关系,对上海市建筑运行能耗相关数据进行数据修复。结果显示,单栋建筑能耗数据适合采用箱线图内限来识别异常记录,并使用中位数填补缺失数据;上海市建筑运行能耗相关数据中,岭回归模型的拟合情况最好。

关 键 词:四节一环保  数据清洗  数据修复  DBSCAN聚类算法  基于用户的协同过滤推荐算法  岭回归
收稿时间:2019-08-13

Research on Cleaning and Repairing Methods of Civil Building Data on Resources Saving and Environment Protection
SHEN Hongyi,XU Fangfang,WANG Xinmin. Research on Cleaning and Repairing Methods of Civil Building Data on Resources Saving and Environment Protection[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020, 56(5): 785-795. DOI: 10.13209/j.0479-8023.2020.019
Authors:SHEN Hongyi  XU Fangfang  WANG Xinmin
Affiliation:1. Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 1008712. College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao 2665903. School of Mathematical Sciences, Peking University, Beijing 100871
Abstract:Aiming at the data quality issues existing in the original civil building data on resources saving and environment protection, various methods are used to achieve data cleaning and data repairing. In terms of data cleaning, the authors focus on the approximately duplicated records and abnormal records in the energy consumption data of single building. In particular, the methods for identifying abnormal records include the empirical rule, the DBSCAN clustering algorithm, and inner fence of boxplot. In terms of data repairing, the authors focus on completing missing values and using the models to achieve data correction. In particular, the missing values are filled in these ways: existing values in the datasets, the predicted values of the linear regression model, and the output of the user-based collaborative filtering recommendation algorithm. The average absolute error is used as an evaluation index to compare these filling results. While repairing the building energy consumption data from Shanghai, multiple linear regression, principal component regression, partial least squares regression, ridge regression and Lasso regression are used to fit the correlation between building energy consumption and explanatory variables. The results show that for the energy consumption data of single building, it’s suitable to use the inner fence of boxplot to identify abnormal records, and use the median to complete missing values. For the building energy consumption data from Shanghai, the ridge regression model fits best.
Keywords:resources saving and environment protection  data cleaning  data repairing  DBSCAN clustering algorithm  user-based collaborative filtering  ridge regression  
点击此处可从《北京大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《北京大学学报(自然科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号