首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于XML的Web内容挖掘方法
引用本文:郑霞,陈建国.基于XML的Web内容挖掘方法[J].沈阳大学学报,2012,24(3):52-55.
作者姓名:郑霞  陈建国
作者单位:1. 闽江学院计算机科学系,福建福州,350001
2. 福建工程学院软件学院,福建福州,350003
摘    要:在分析Web内容挖掘特征的基础上,提出一种基于XML技术的Web内容挖掘模型.利用HITS算法确定权威Web页面,利用HTMLTidy工具将非XML文件经过数据清洗后转换成结构良好的XMI。文档,结合互联网上传统科技论文的自动抽取系统实例,采用文本聚类分类技术进行面向XML文档数据的数据挖掘.实验结果表明,该模型工作良好,可以自动、有效地提取网页内容.

关 键 词:Web挖掘  数据挖掘  文本聚类  非XML文档

Method of Web Content Mining based on XML
ZHENG Xia,CHEN Jianguo.Method of Web Content Mining based on XML[J].Journal of Shenyang University,2012,24(3):52-55.
Authors:ZHENG Xia  CHEN Jianguo
Institution:1. Department of Computer Science, Minjiang University, Fuzhou 340001, China 2. Software College, Fujian University of Technology, Fuzhou 350003, China)
Abstract:The characteristics of Web content mining were analyzed and a model of Web content mining was proposed base on XML. The HITS algorithm was used to determine the authority of Web pages, the HTML Tidy tool was used for non-XML documents through the data cleansing and transform XML documents into well-formed, and text clustering techniques were used for XML document classification data in data mining. Combining with the examples of traditional scientific papers of automated extraction system from Internet, the model is proved to work well, and it can automatically and effectively extract web page content.
Keywords:Web Mining  data mining  text clustering  non-XML documents
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号