基于DOM的半结构化网页信息抽取算法 Information extraction from semi-structured WEB page based on DOM tree期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于DOM的半结构化网页信息抽取算法

引用本文：	李卫东.基于DOM的半结构化网页信息抽取算法[J].河北省科学院学报,2009,26(1):21-24.

作者姓名：	李卫东

作者单位：	河北经贸大学倩息技术学院,河北,石家庄,050061

基金项目：	河北省科学技术研究与发展计划项目，河北省科学技术与发展规划项目，河北省科学技术与发展规划项目

摘要：	为从不同的半结构化网页中自动提取数据记录，提出了基于DOM和记录子树最大相似度发现记录模武的思想，对信息噪声有较强的过滤功能，在记录模式存在一定差异的情况下也能正确识别记录。在此基础上，实现了多记录网页自动抽取的IESS算法．该系统可以从多个学术论文检索网站中自动获取结果网页。并自动抽取其中的记录。对常见论文检索网站的实验表明了该系统具有较好的有效性和准确性。
关键词：	DOM 信息抽取半结构化信息集成
Information extraction from semi-structured WEB page based on DOM tree

LI Wei-dong.Information extraction from semi-structured WEB page based on DOM tree[J].Journal of The Hebei Academy of Sciences,2009,26(1):21-24.

Authors:	LI Wei-dong

Institution:	College of Information and Technology;Hebei University of Economics and Business;Shijiazhuang Hebei 050061;China

Abstract:	To extract information automatically from semi-structured web pages,this paper puts forward a method named IESS for discovering the record model based on DOM and Maximal Similar Sub Tree which can identify records automatically and correctly when there are some differences in expression models of records that belong to the same type.Furthermore,the system can extract result pages information from paper searching websites automatically.The application experiments showed that this system has high efficiency a...

Keywords:	DOM
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏