首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于序列数据挖掘的中文网页特征选择方法
引用本文:谷峰,刘晨曦,吴扬扬.基于序列数据挖掘的中文网页特征选择方法[J].山东大学学报(理学版),2006,41(3):95-99.
作者姓名:谷峰  刘晨曦  吴扬扬
作者单位:华侨大学,计算机科学系,福建,泉州,362021
摘    要:提出了一种基于序列数据挖掘的中文网页候选特征的选择方法,并用于中文网页分类模型. 该方法运用改进的PAT树结构挖掘频繁出现在同一类中文网页中的字符串,通过净频率计算,挖掘出中文网页中频繁出现的有意义的词、短语、英文单词等,并结合CHI算法得到文本特征. 实验表明,该算法不仅能挖掘出传统方法所选择出的绝大部分特征,还能挖掘出一些有意义的、切词系统词库中没有的、能反映分类特点的人名,地名,新词、常用语、外文单词等.

关 键 词:序列数据挖掘  pat树  净频率  频繁字串  中文网页分类
文章编号:1671-9352(2006)03-0097-04
收稿时间:2006-03-29
修稿时间:2006-03-29

Chinese Web page feature selection method based on Sequential data mining
GU Feng,LIU Chen-xi,WU Yang-yang.Chinese Web page feature selection method based on Sequential data mining[J].Journal of Shandong University,2006,41(3):95-99.
Authors:GU Feng  LIU Chen-xi  WU Yang-yang
Institution:Department of computer science and technology, Huaqiao Univ., Quanzhou 362021, Fujian, China
Abstract:A method is proposed to select feature candidates.from Chinese websites on the basis of sequential data mining, and it is used in the model of Chinese websites classification. This method uses improved PAT tree data structure to mine the frequent strings in the same class of Chinese websites, calculates the net frequency, mines frequent meaningful words, phrases, and English words from Chinese websites, and obtains text features with the help of the CHI algorithm. Experiments show that this algorithm not only mines most of the features selected by the traditional algorithm, but alse mines some new meaningful personnames, placenames, new words, phrases, and foreign words.
Keywords:sequential data mining  pat-tree  net frequency  frequent string  chinese web page classification
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《山东大学学报(理学版)》浏览原始摘要信息
点击此处可从《山东大学学报(理学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号