首页 | 本学科首页   官方微博 | 高级检索  
     

一种新的Web链接提取模型
引用本文:苏杭,严建援. 一种新的Web链接提取模型[J]. 清华大学学报(自然科学版), 2006, 46(Z1): 975-982
作者姓名:苏杭  严建援
作者单位:1. EECS Department,Vanderbilt Universtty,Nashville,TN 37235,USA
2. 南开大学,商学院,天津,300071,中国
摘    要:以搜索引擎链接提取模块所要求的容错性、正确性、全面性、高效性和可扩展性为目标,提出了一种新的链接提取模型的设计思路。该模型将链接提取过程划分为信息提取、信息加工、信息分析和信息储存。信息的获取是通过HTM L(hypertex t m arkup language)文法分析方法从文档中得到初始统一资源地址(un iform resourceiden tifier,UR I)数据;信息加工阶段通过运用UR I解析算法对初始数据进行精练;然后在信息分析过程中进一步加以筛选和过滤;最后将结果存储在一个灵活的数据结构中。通过对比测试证实这种新的链接提取模式比传统方法在各项指标上均具有明显优势。

关 键 词:搜索引擎  链接提取  统一资源地址(URI)
文章编号:1000-0054(2006)S1-0975-08
修稿时间:2006-02-28

A new model for Web URL extraction
SU Hang,YAN Jianyuan. A new model for Web URL extraction[J]. Journal of Tsinghua University(Science and Technology), 2006, 46(Z1): 975-982
Authors:SU Hang  YAN Jianyuan
Abstract:This paper concludes the basic objectives in the design of uniform resource identifier(URL) extractor module in web mining,which are robustness,correctness,completeness,effectiveness,and expansibility.With these objectives,the paper analyzes the weakness of the original design and furthermore,generalizes a new and more powerful design model.The new method divides URL extraction into four steps: information extraction,information refinement,information analysis,and information storage. During the information extraction process,the raw URI information is extracted from HTML source by using HTML BNF parsing.The initial raw data are refined and normalized in the information refinement step,and validated and filtered in the information analysis phase.Eventually,the remaining correct and useful URI information is stored in a flexible and highly extensible data structure.The comparison test shows that this new model has better performance than the traditional methods.
Keywords:search engine  uniform resource identifier(URL) extractor  uniform resource identifier
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号