首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于序列比对的动态Web信息抽取算法
引用本文:赵刚,郭东伟,李丹.基于序列比对的动态Web信息抽取算法[J].吉林大学学报(理学版),2010,48(3):421-426.
作者姓名:赵刚  郭东伟  李丹
作者单位:1. 吉林大学 网络教育学院, 长春 130022; 2. 吉林大学 计算机科学与技术学院, 长春 130012;3. 大连理工大学 继续教育学院, 辽宁 大连 116011
基金项目:吉林省科技发展计划项目 
摘    要:基于对深网(DeepWeb)网页公共框架的定义,提出一种在信息抽取算法中增加公共框架检测阶段,采用序列比对算法提取公共框架的方法.与原始网页数据相比,去除公共框架的数据域信息对模板抽取更有利.基于真实网站的数据密集型网页集合,测试和对比了序列比对算法中参数不同取值以及公共框架检测阶段在数据量和抽取准确率等方面对信息抽取算法的影响.实验结果表明了算法的有效性.

关 键 词:Web信息抽取    序列比对    公共框架检测  
收稿时间:2009-02-17

Dynamic Web Information Extraction Based on Sequence Alignment
ZHAO Gang,GUO Dong-wei,LI Dan.Dynamic Web Information Extraction Based on Sequence Alignment[J].Journal of Jilin University: Sci Ed,2010,48(3):421-426.
Authors:ZHAO Gang  GUO Dong-wei  LI Dan
Institution:1. College of Network Education, Jilin University, Changchun 130022, China;2. College of Computer Science and Technology, Jilin University, Changchun |130012, China;3. School of Continuing Education, Dalian University of Technology, Dalian |116011, Liaoning Province, China
Abstract:Based on “common framework” defined as the information which is irrelative to the kernel contents of Web pages and common in Web pages from the same source, sequence alignment was adopted in the information extraction algorithm to detect the common framework. After eliminating the common frameworks from Web pages, the data fields obtained will be more suitable for information extraction. On the data intensive Web pages from real world websites, the effects of the alignment parameter values on extraction results and those of the phase of common framework detection on decreasing data quantity and increasing extraction accuracy were tested and evaluated. The experimental results prove the validity of this approach convincingly.
Keywords:Web information extraction  sequence alignment  common framework detection  
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《吉林大学学报(理学版)》浏览原始摘要信息
点击此处可从《吉林大学学报(理学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号