首页 | 本学科首页   官方微博 | 高级检索  
     

文本分类中基于CHI和PCA混合特征的降维方法
引用本文:唐加山,段丹丹. 文本分类中基于CHI和PCA混合特征的降维方法[J]. 重庆邮电大学学报(自然科学版), 2022, 34(1): 164-171. DOI: 10.3979/j.issn.1673-825X.202003020060
作者姓名:唐加山  段丹丹
作者单位:南京邮电大学 理学院,南京210023
摘    要:中文文本数据的半结构化甚至非结构化的特点使得其分类存在着特征高维的问题,传统单一的特征降维方法难以满足大数据时代的文本分类需求.基于此,提出了一种基于卡方统计(Chi-square statistics,CHI)和主成分分析(principal component analysis,PCA)的混合特征降维方法(CHI-...

关 键 词:中文文本分类  特征降维  混合特征降维方法(CHI-PCA)  卡方统计(CHI)方法  主成分分析(PCA)
收稿时间:2020-03-02
修稿时间:2021-11-11

Research on dimension reduction method based on mixed features of CHI and PCA in text classification
TANG Jiashan,DUAN Dandan. Research on dimension reduction method based on mixed features of CHI and PCA in text classification[J]. Journal of Chongqing University of Posts and Telecommunications, 2022, 34(1): 164-171. DOI: 10.3979/j.issn.1673-825X.202003020060
Authors:TANG Jiashan  DUAN Dandan
Affiliation:College of Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, P. R. China
Abstract:The semi-structured and even unstructured characteristics of Chinese text data make its classification have high-dimensional features. The traditional single feature dimensionality reduction method is difficult to meet the text classification needs in the era of big data. Based on this, a hybrid feature dimension reduction method (CHI-PCA) based on Chi-square statistics (CHI) and principal component analysis(PCA) is proposed. This method uses the CHI method to initially screen out category-related feature words, and then the PCA method is used to perform two-dimensional dimensionality reduction on the feature word space. While reducing the feature dimensions, it still retains the most feature information of the original feature space. After comparison experiments with traditional feature dimensionality reduction methods document frequency (DF), Information gain (IG), CHI and PCA methods, the results show that under different feature dimensions, the overall classification effect of the proposed method under Softmax regression, support vector machines(SVM) classification and KNN classifier is better than the comparison method. The Macro-F1 value is improved by up to 2.7%, and the classification performance in each category is also considerable. This shows that the two-stage feature dimensionality reduction method based on CHI-PCA is feasible. It improves the classification performance while reducing the dimensionality of features.
Keywords:Chinese text classification  feature reduction  Chi-square statistics-principal component analysis (CHI-PCA)  Chi-square statistics (CHI)  principal component analysis (PCA)
本文献已被 万方数据 等数据库收录!
点击此处可从《重庆邮电大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《重庆邮电大学学报(自然科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号