东北大学学报:自然科学版 ›› 2020, Vol. 41 ›› Issue (11): 1521-1527.DOI: 10.12068/j.issn.1005-3026.2020.11.001

• 信息与控制 •    下一篇

一种可用于分类型属性数据的多变量决策树算法

刘振宇1,2, 宋晓莹2   

  1. (1. 东北大学 软件中心, 辽宁 沈阳110819; 2. 大连东软信息学院 网络安全与计算技术重点实验室, 辽宁 大连116023)
  • 收稿日期:2019-10-24 修回日期:2019-10-24 出版日期:2020-11-15 发布日期:2020-11-16
  • 通讯作者: 刘振宇
  • 作者简介:刘振宇(1978-),男,辽宁葫芦岛人,东北大学博士研究生.
  • 基金资助:
    国家自然科学基金资助项目(61772101,61602075); 辽宁省重点研发计划项目(2018).

An Applicable Multivariate Decision Tree Algorithm for Categorical Attribute Data

LIU Zhen-yu1,2, SONG Xiao-ying2   

  1. 1. Software Center, Northeastern University, Shenyang 110819, China; 2. Key Laboratory of Network Security and Computing Technology, Dalian Neusoft University of Information, Dalian 116023, China.
  • Received:2019-10-24 Revised:2019-10-24 Online:2020-11-15 Published:2020-11-16
  • Contact: LIU Zhen-yu
  • About author:-
  • Supported by:
    -

摘要: 针对绝大部分多变量决策树只能联合数值型属性,而不能直接为带有分类型属性数据集进行分类的问题,提出一种可联合多种类型属性的多变量决策树算法(CMDT).该算法通过统计各个分类型属性的属性值在各个类别或各个簇中的频率分布,来定义样本集合在分类型属性上的中心,以及样本到中心的距离.然后,使用加权k-means算法划分决策树中的非终端结点.使用这种结点划分方法构建的决策树可用于数值型数据、分类型数据以及混合型数据.实验结果表明,该算法建立的分类模型在各种类型的数据集上均获得比经典决策树算法更好的泛化正确率和更简洁的树结构.

关键词: 决策树, 分类型属性, 多变量决策树, 结点划分, k-均值

Abstract: Most multivariate decision trees are applicable for only the numerical data. To solve the classification problem on categorical attribute data, an applicable multivariate decision tree(CMDT) algorithm is proposed. The center of the sample set on the categorical attributes, and the distance between the samples and the centers are defined with statistics for frequency distribution of categorical attribute values in each category or each cluster. Weighted k-means algorithm is utilized to split the nodes in the decision tree. The proposed multivariate decision tree is applicable for numerical data, categorical data, and mixed data. Experiment results show that the classification model based on the proposed algorithm can get more concise tree construction and higher generalization accuracy than that based on the classic decision tree algorithms with different kinds of data.

Key words: decision tree, categorical attribute, multivariate decision tree, node split, k-means

中图分类号: