Non-Independent Term Selection for Chinese Text Categorization |
| |
Authors: | LI Jingyang SUN Maosong |
| |
Institution: | LI Jingyang,SUN Maosong Department of Computer Science , Technology,Tsinghua University,Beijing 100084,China |
| |
Abstract: | Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams),which results in very slow training and working of modern high-performance classifiers.This study assumes that this high-dimensionality problem is related to the redundancy in the term set,which cannot be solved by traditional term selection methods.A greedy algorithm framework named "non-independent term selection" is presented,which reduces the redundancy according to string-level correlations.Several preliminary implementations of this idea are demonstrated.Experiment results show that a good tradeoff can be reached between the performance and the size of the term set. |
| |
Keywords: | Chinese text categorization term selection dimentionality |
本文献已被 CNKI 万方数据 ScienceDirect 等数据库收录! |
|