首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于结构树的网页正文内容抽取方法
引用本文:魏海平.基于结构树的网页正文内容抽取方法[J].科学技术与工程,2011(28).
作者姓名:魏海平
作者单位:辽宁石油化工大学计算机与通信工程学院
摘    要:网页文本抽取是一种在互联网上运用广泛的数据挖掘技术。主要目的是把一个网页的主题内容抽取出来,为Web数据挖掘提供好的基础数据。本文基于网页树形结构进行改进,首先对网页进行分块,把每一块存储在树形结构当中,然后通过对所有块进行方差和阈值计算,选择出主题信息。该方法相比传统的基于正则表达式的抽取方法, 具有简单, 实用的特点, 实验结果表明, 该抽取方法准确率达到 96%以上, 有一定的实用价值。

关 键 词:结构树  信息抽取  网页分块
收稿时间:7/5/2011 3:07:05 PM
修稿时间:7/5/2011 3:07:05 PM

The Method of Content Extraction from Webpage based on Structure Tree
weihaiping.The Method of Content Extraction from Webpage based on Structure Tree[J].Science Technology and Engineering,2011(28).
Authors:weihaiping
Abstract:Content extraction is a kind of data mining technology which is widely used in Internet. The main purpose is to extract the topic content and provide the data for Web Data Mining .In this paper, to improve web-based tree structure, First of all the Webpage divided into blocks , to each block of which is stored in the tree structure, then all the blocks of variance and threshold calculation, choose the topic information. In comparison with traditional methods based on Regular Expressions, this method is more simple and useful. Experimental results show that the extraction precision is higher than 96%, and the method has good value of practice.
Keywords:Structure Tree  Information Extraction  Page Segmentation
点击此处可从《科学技术与工程》浏览原始摘要信息
点击此处可从《科学技术与工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号