首页 | 本学科首页   官方微博 | 高级检索  
     

基于MapReduce的分布式改进随机森林学生就业数据分类模型研究
引用本文:乔非,葛彦昊,孔维畅. 基于MapReduce的分布式改进随机森林学生就业数据分类模型研究[J]. 系统工程理论与实践, 2017, 37(5): 1383-1392. DOI: 10.12011/1000-6788(2017)05-1383-10
作者姓名:乔非  葛彦昊  孔维畅
作者单位:同济大学 电子与信息工程学院 CIMS中心, 上海 201804
基金项目:国家自然科学基金(71690234)
摘    要:教育数据挖掘(educational data mining)是当代教育信息化发展的前沿研究领域,正在吸引越来越多教育学家和数据科学家的关注."大数据"时代背景下,随着数据处理规模的不断激增,现有的数据挖掘模型在单一处理节点的计算能力遭遇瓶颈,各类面向大数据处理的分布式计算框架应运而生.借助这些框架,面向解决高校就业数据挖掘问题的机器学习模型便可以满足未来大规模数据处理的需求,在未来数据集体量庞大的信息集成系统中为数据挖掘和决策支持提供帮助.以此为背景,本研究对比现有数据模型对研究目标对象的分类性能,提出了以引入输入特征加权系数来计算特征的信息增益作为特征最优分裂评判指标的改进随机森林模型来提升数据分类性能,通过仿真测试改进模型对于现有模型分类性能的提升情况,与此同时为解决大数据时代背景下面向海量数据分类任务的单节点性能瓶颈问题,提出了基于分布式改进随机森林算法的大规模学生就业数据分类预测模型.通过使用MapReduce分布式计算框架实现已训练模型在本地磁盘与分布式文件系统之间的序列化写入与反序列化加载过程,进而实现了基于改进随机森林模型的大规模数据分类模型的分布式扩展.

关 键 词:机器学习  数据分类模型  大数据处理  MapReduce  
收稿时间:2016-07-18

MapReduce based distributed improved random forest model for graduates career classification
QIAO Fei,GE Yanhao,KONG Weichang. MapReduce based distributed improved random forest model for graduates career classification[J]. Systems Engineering —Theory & Practice, 2017, 37(5): 1383-1392. DOI: 10.12011/1000-6788(2017)05-1383-10
Authors:QIAO Fei  GE Yanhao  KONG Weichang
Affiliation:CIMS Research Center, College of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
Abstract:Educational data mining is a research area of using data mining technology in education industry. In the research of EDM, data mining technology is used to modeling dataset samples in the field of education, which aims to study and forecast the testing data set with the help of effective statistical machine learning models. Machine learning models with distributed computing frameworks in the EDM can meet the needs of large-scale data processing meanwhile provide tailored data recommendation and then support decision-making in the future. Based on this background, this study first put all kinds of data models into the data training and predicting for simulation, propose an improved model to ameliorate the classification performance of the data model by adjusting the data model and by using an improved algorithm based on a new equation of information gain when calculating the optimal feature to split. Based on the best-performance data model in previous study combined with the application background of the "big data" era, we proposed a new random forest algorithm model focusing on giving classification to large-scale datasets based on distributed computing framework called MapReduce. By using the MapReduce, we design and realize a new system to meet this requirement. In this system, the model that has been trained can be serialized and deserialization between local disks and the distributed file system.
Keywords:machine learning  data classification model  big data processing  MapReduce
本文献已被 CNKI 等数据库收录!
点击此处可从《系统工程理论与实践》浏览原始摘要信息
点击此处可从《系统工程理论与实践》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号