首页 | 本学科首页   官方微博 | 高级检索  
     检索      

Web内容抽取及其数据管理方法
引用本文:张成洪,肖军建,张诚.Web内容抽取及其数据管理方法[J].复旦学报(自然科学版),2001,40(2):177-183.
作者姓名:张成洪  肖军建  张诚
作者单位:复旦大学管理学院,
摘    要:随着Internet及其相关技术的飞速发展,WWW已成为最大的信息集散地,无论对企业还是个人,Web逐渐成为最主要的信息来源,然而由于网站数量过多以及由此带来的信息泛滥,使得有用信息的获取越来越困难,搜索引擎只能提供信息的查找范围,而具体的内容还是要靠详细搜查,而且网页信息都是非结构化或半结构化的,无法直接利用分析工具进行分析,所以有必要提供一种网页内容自动抽取及使网页数据结构化的方法,来简化信息获取的过程和方便信息分析处理。

关 键 词:数据抽取  网页包装  规则表达式  模式匹配  Internet  WWW  Web数据集成系统  数据管理  网页数据结构化
文章编号:0427-7104(2001)02-0177-07

Web Content Extraction & Its Data Management Method
ZHANG Cheng-hong,XIAO Jun-jian,ZHANG Cheng.Web Content Extraction & Its Data Management Method[J].Journal of Fudan University(Natural Science),2001,40(2):177-183.
Authors:ZHANG Cheng-hong  XIAO Jun-jian  ZHANG Cheng
Abstract:With the development of Internet and its relative technology, the WWW has become the largest information area. For the enterprise or the individual, Web becomes the main information source gradually. However, because of too many web sites and the information overflow resulting from this, it is more and more difficult to obtain useful information. Search engines only provide the scope of the searching information, and the concrete information must be looked up carefully by oneself. Because Web information is non strutured or semi structured, the analysis tool can't be used to analyze it directly. So it is necessary to advance a method of extracting the Web content automatically and structuring the Web data to simplify the process of obtaining information and facilitate the information analysis. This paper will describe this in detail.
Keywords:data extraction  Web wrapper  regular expression  semi  structured  pattern matching  
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号