基于序列数据挖掘的中文网页特征选择方法 Chinese Web page feature selection method based on Sequential data mining期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于序列数据挖掘的中文网页特征选择方法

引用本文：	谷峰,刘晨曦,吴扬扬.基于序列数据挖掘的中文网页特征选择方法[J].山东大学学报(理学版),2006,41(3):95-99.

作者姓名：	谷峰刘晨曦吴扬扬

作者单位：	华侨大学,计算机科学系,福建,泉州,362021

摘要：	提出了一种基于序列数据挖掘的中文网页候选特征的选择方法，并用于中文网页分类模型. 该方法运用改进的PAT树结构挖掘频繁出现在同一类中文网页中的字符串，通过净频率计算，挖掘出中文网页中频繁出现的有意义的词、短语、英文单词等，并结合CHI算法得到文本特征. 实验表明，该算法不仅能挖掘出传统方法所选择出的绝大部分特征，还能挖掘出一些有意义的、切词系统词库中没有的、能反映分类特点的人名，地名，新词、常用语、外文单词等.
关键词：	序列数据挖掘 pat树净频率频繁字串中文网页分类
文章编号：	1671-9352（2006）03-0097-04
收稿时间：	2006-03-29
修稿时间：	2006-03-29
Chinese Web page feature selection method based on Sequential data mining

GU Feng,LIU Chen-xi,WU Yang-yang.Chinese Web page feature selection method based on Sequential data mining[J].Journal of Shandong University,2006,41(3):95-99.

Authors:	GU Feng LIU Chen-xi WU Yang-yang

Institution:	Department of computer science and technology, Huaqiao Univ., Quanzhou 362021, Fujian, China

Abstract:	A method is proposed to select feature candidates.from Chinese websites on the basis of sequential data mining, and it is used in the model of Chinese websites classification. This method uses improved PAT tree data structure to mine the frequent strings in the same class of Chinese websites, calculates the net frequency, mines frequent meaningful words, phrases, and English words from Chinese websites, and obtains text features with the help of the CHI algorithm. Experiments show that this algorithm not only mines most of the features selected by the traditional algorithm, but alse mines some new meaningful personnames, placenames, new words, phrases, and foreign words.

Keywords:	sequential data mining pat-tree net frequency frequent string chinese web page classification
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《山东大学学报(理学版)》浏览原始摘要信息
	点击此处可从《山东大学学报(理学版)》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏