首页 | 本学科首页   官方微博 | 高级检索  
     检索      

信息抽取中网站结构树生成方法的研究
引用本文:朱英,瞿有利,陈谊,孙悦红.信息抽取中网站结构树生成方法的研究[J].北京工商大学学报(自然科学版),2006,24(5):54-58.
作者姓名:朱英  瞿有利  陈谊  孙悦红
作者单位:1. 北京工商大学,计算机学院,北京,100037
2. 北京交通大学,计算机与信息技术学院,北京,100044
摘    要:随着Internet技术的发展和普及,W eb上的信息量猛增,使信息抽取更具有挑战性.从网站的拓扑结构入手,提出了信息抽取中网站结构树的生成算法,该算法首先根据网页结点URL所在目录的层次关系,去掉网站结构图中的部分回溯边;然后在宽度优先遍历的过程中去掉已经遍历过的重复结点,生成网站结构树.最后引入编辑距离对算法生成的网站结构树与实际的网站结构树的相似程度进行评价,两棵树的相似程度比较高,均达到了90%以上.利用生成的网站结构树可以对网站的内容页面(即结构树的叶子结点)进行聚类,最后进行信息抽取,大大提高抽取的准确率与召回率.

关 键 词:信息抽取  网站  结构图  结构树  编辑距离
文章编号:1671-1513(2006)05-0054-05
收稿时间:2006-07-05
修稿时间:2006年7月5日

RESEARCH ON SPANNING TREE OF WEBSITE STRUCTURE FOR INFORMATION EXTRACTION
ZHU Ying,QU You-li,CHEN Yi,SUN Yue-hong.RESEARCH ON SPANNING TREE OF WEBSITE STRUCTURE FOR INFORMATION EXTRACTION[J].Journal of Beijing Technology and Business University:Natural Science Edition,2006,24(5):54-58.
Authors:ZHU Ying  QU You-li  CHEN Yi  SUN Yue-hong
Institution:1. College of Computer Science and Technology, Beijing Technology and Business University, Beijing 100037, China ; 2. College of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
Abstract:With the exponential growth of information on the Web,the information extraction is becoming more and more challenging.According to the topology of website,this paper presents the spanning tree algorithm of website structure for information extraction.The algorithm firstly removes parts of the tracing edges in website structured graph according to the hierarchy of URL directories,and then deletes the nodes which have been traversed during the course of breadth-first traversing.At the end of the paper,edit distance is introduced to evaluate the similarity between the structured tree spanned by the algorithm and the actual tree, and the similarity rate comes to over 90%.The spanning tree can be used to cluster the content pages and to extract information so that precision and recall will be improved.
Keywords:information extraction  website  structure graph  structure tree  edit distance
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号