基于XML的Web内容挖掘方法 Method of Web Content Mining based on XML期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于XML的Web内容挖掘方法

引用本文：	郑霞,陈建国.基于XML的Web内容挖掘方法[J].沈阳大学学报,2012,24(3):52-55.

作者姓名：	郑霞陈建国

作者单位：	1. 闽江学院计算机科学系,福建福州,350001 2. 福建工程学院软件学院,福建福州,350003

摘要：	在分析Web内容挖掘特征的基础上，提出一种基于XML技术的Web内容挖掘模型．利用HITS算法确定权威Web页面，利用HTMLTidy工具将非XML文件经过数据清洗后转换成结构良好的XMI。文档，结合互联网上传统科技论文的自动抽取系统实例，采用文本聚类分类技术进行面向XML文档数据的数据挖掘．实验结果表明，该模型工作良好，可以自动、有效地提取网页内容．
关键词：	Web挖掘数据挖掘文本聚类非XML文档
Method of Web Content Mining based on XML

ZHENG Xia,CHEN Jianguo.Method of Web Content Mining based on XML[J].Journal of Shenyang University,2012,24(3):52-55.

Authors:	ZHENG Xia CHEN Jianguo

Institution:	1. Department of Computer Science, Minjiang University, Fuzhou 340001, China 2. Software College, Fujian University of Technology, Fuzhou 350003, China)

Abstract:	The characteristics of Web content mining were analyzed and a model of Web content mining was proposed base on XML. The HITS algorithm was used to determine the authority of Web pages, the HTML Tidy tool was used for non-XML documents through the data cleansing and transform XML documents into well-formed, and text clustering techniques were used for XML document classification data in data mining. Combining with the examples of traditional scientific papers of automated extraction system from Internet, the model is proved to work well, and it can automatically and effectively extract web page content.

Keywords:	Web Mining data mining text clustering non-XML documents
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏