基于树自动机的网页数据抽取 Web Pages Data Extraction Based on Tree Automata期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于树自动机的网页数据抽取

引用本文：	王茹,宋瀚涛,陆玉昌.基于树自动机的网页数据抽取[J].北京理工大学学报,2004,24(9):790-793.

作者姓名：	王茹宋瀚涛陆玉昌

作者单位：	北京理工大学,信息科学技术学院计算机科学工程系,北京,100081;清华大学,智能技术与系统国家重点实验室,北京,100084

基金项目：	国家重点基础研究发展计划(973计划)

摘要：	为了自动将数据从HTML网页中抽取出来,采取树自动机推断方式进行数据抽取.核心思想是将样本网页转化为二叉树并构建出能够接受这些网页二叉树的树自动机,利用所得到的树自动机对待抽取网页的接受和拒绝状态进行数据抽取.该方法充分利用了HTML文档内在的树状结构,设计了简单方便的样本网页标注形式.实验表明,该方法的抽取性能在查全率和F值方面优于其它的一些数据抽取方法.
关键词：	数据抽取树自动机 Web网页 HTML
文章编号：	1001-0645(2004)09-0790-04
收稿时间：	2003/10/10 0:00:00
修稿时间：	2003年10月10日
Web Pages Data Extraction Based on Tree Automata

WANG Ru,SONG Han-tao and LU Yu-chang.Web Pages Data Extraction Based on Tree Automata[J].Journal of Beijing Institute of Technology(Natural Science Edition),2004,24(9):790-793.

Authors:	WANG Ru SONG Han-tao and LU Yu-chang

Institution:	WANG Ru~1,SONG Han-tao~1,LU Yu-chang~2

Abstract:	In order to extract data from HTML Web pages automatically, tree automata induction has been used in data extraction. The key idea is to transform the example tree into a binary tree, creating a tree automata which can accept the binary tree of example pages and using the tree automata to extract data according to tree automata state of acceptance and rejection. The method makes use of the native tree structure of HTML document and designs a new simple form of labeling the example pages. Experimental results on data sets showed that the approach with tree automata compared favorable against some other approaches in the F-score and recall.

Keywords:	data extraction tree automata Web pages HTML
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《北京理工大学学报》浏览原始摘要信息
	点击此处可从《北京理工大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏