首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于高斯过采样的集成学习算法
引用本文:张忠良,陈愉予,唐佳怡,雒兴刚. 一种基于高斯过采样的集成学习算法[J]. 系统工程理论与实践, 2021, 0(2): 513-523
作者姓名:张忠良  陈愉予  唐佳怡  雒兴刚
作者单位:杭州电子科技大学管理学院;上海交通大学安泰经济管理学院
基金项目:国家自然科学基金重点项目(71831006);国家自然科学基金青年项目(71801065);浙江省自然科学基金重点项目(LZ20G010001);浙江省属高校基本科研业务费专项资金(GK209907299001-2)。
摘    要:在数据挖掘研究领域,分类任务广泛存在着数据分布不均衡问题,例如制造状态检测,医疗诊断,金融服务,等等.SMOTE是处理不均衡数据分类问题的常用技术,与Boosting算法相结合可进一步提升分类系统性能,但是这种集成学习容易导致基分类器多样性缺失.基于此,本文提出了一种基于高斯过程SMOTE过采样的Boosting集成学习算法(Gaussian-based smote in boosting,GSMOTEBoost).该算法在Boosting集成框架下构建不均衡学习模型,为了提高分类系统的鲁棒性,采用基于高斯过程SMOTE过采样技术来增加基分类器训练样本的多样性,从而提高基分类器之间的差异.为了验证算法的有效性,以常用的处理不均衡分类问题的算法作为对比方法,采用KEEL数据库里的20个标准数据集对算法进行测试,以G-mean,F-measure以及AUC作为算法的评价指标,利用统计检验手段对实验结果进行分析.实验结果表明,相对于其他算法,本文提出的GSMOTEBoost具有显著的优势.

关 键 词:不均衡数据  分类算法  SMOTE  集成学习  数据挖掘

An ensemble learning algorithm with Gaussian-based oversampling
ZHANG Zhongliang,CHEN Yuyu,TANG Jiayi,LUO Xinggang. An ensemble learning algorithm with Gaussian-based oversampling[J]. Systems Engineering —Theory & Practice, 2021, 0(2): 513-523
Authors:ZHANG Zhongliang  CHEN Yuyu  TANG Jiayi  LUO Xinggang
Affiliation:(College of Management,Hangzhou Dianzi University,Hangzhou 310018,China;Antai College of Economics and Management,Shanghai Jiao Tong University,Shanghai 200030,China)
Abstract:The class imbalance learning widely occurs in classification tasks in the research field of data mining,such as manufacturing quality conditions,medical diagnosis,financial service,etc.The synthetic minority over-sampling technique(SMOTE) is a common technique to deal with imbalanced datasets,which can be enhanced using the framework of the boosting algorithm.However,this strategy can easily result in the lack of diversity of the base classifiers in the ensemble learning system.On this account,a boosting learning algorithm integrated Gaussian process smote oversampling is proposed to solve the imbalance learning problem,namely Gaussian-based smote in boosting(GSMOTEBoost).In order to improve the robustness of the classification system,the proposed GSMOTEBoost algorithm is developed using the framework of AdaBoost,in which a smote oversampling technology based on Gaussian process is used to increase the diversity of the base classifiers for each iteration.To verify the effectiveness of our algorithm,we develop the experiments on twenty datasets selected from the KEEL repository with these well-known imbalance learning algorithms.The G-mean,F-measure and AUC are considered as the assessment metrics and the hypothesis testing is used to analyze the experimental results.The obtained results,supported by the proper statistical analysis,indicate that the proposed GSMOTEBoost significantly outperforms the comparison methods.
Keywords:imbalanced data  classification algorithm  SMOTE  ensemble learning  data mining
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号