基于条件随机域的Web信息抽取 Web information extraction based on conditional random fields期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于条件随机域的Web信息抽取

引用本文：	史庆伟,赵政,鲍虎.基于条件随机域的Web信息抽取[J].辽宁工程技术大学学报(自然科学版),2007,26(4):570-572.

作者姓名：	史庆伟赵政鲍虎

作者单位：	1. 天津大学,计算机科学与技术学院,天津,300072;辽宁工程技术大学,软件学院,辽宁,葫芦岛,125105 2. 天津大学,计算机科学与技术学院,天津,300072 3. 天津大学,计算机科学与技术学院,天津,300072;海军航空工程学院,电子信息工程系,山东,烟台,264001

基金项目：	天津市科技发展计划基金资助项目(07JCZDJC067007)

摘要：	为了获取隐藏在Internet中的信息,基于条件随机域模型(CRF),提出了一种Web信息抽取的方法。该方法对网页样本中的每一行加注标签,确定文本特征,建立条件随机域模型,采用拟牛顿迭代方法对样本进行训练,参照学习得到的条件概率分布模型,实现网页搜索结果的抽取。与HMM模型相比,CRF模型支持网页文本的语言特征,抽取准确率高。实验结果表明,使用CRF模型的抽取准确率达到90%以上,高于使用HMM模型的抽取准确率。
关键词：	条件随机域信息抽取网页文档拟牛顿法
文章编号：	1008-0562（2007）04-0570-03
修稿时间：	2006-04-12
Web information extraction based on conditional random fields

SHI Qing-wei,ZHAO Zheng,BAO Hu.Web information extraction based on conditional random fields[J].Journal of Liaoning Technical University (Natural Science Edition),2007,26(4):570-572.

Authors:	SHI Qing-wei ZHAO Zheng BAO Hu

Abstract:	In order to obtain the information hidden in the Internet,a method based on conditional random Fields(CRF) is presented to extract web information.With this method,each line of the web documents is labeled to determine the features of the web text and then Quasi-Newton method is employed to train the web text on the basis of the CRF.According to the conditional probability model acquired from the training web documents,web search results are extracted by the proposed method.In contrast to HMM,CRF supports the use of language features of the web documents,so it performs better in precision.Experimental results show that the precision of using CRF reaches more than 90%,which is better than that of HMM.

Keywords:	conditional random fields information extraction Web documents Quasi-Newton method
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏