基于页面实体空间关系的Web对象抽取 Object Extraction Based on Spatial-Relation of Entities from the World Wide Web期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于页面实体空间关系的Web对象抽取

引用本文：	郝敬敏,廖乐健,何迪.基于页面实体空间关系的Web对象抽取[J].北京理工大学学报,2010,30(2):188-192.

作者姓名：	郝敬敏廖乐健何迪

作者单位：	北京理工大学,计算机学院,智能信息技术北京市重点实验室,北京,100081;北京理工大学,计算机学院,智能信息技术北京市重点实验室,北京,100081;北京理工大学,计算机学院,智能信息技术北京市重点实验室,北京,100081

基金项目：	国家自然科学基金资助项目(60873237)

摘要：	针对Web同一对象内部信息组件之间的空间距离小于不同对象之间信息组件之间的距离这一显示特征.提出一种新的Web对象抽取方法.通过分析给定页面中不同实体间的空间位置关系来判断哪些信息成分属于同一对象,与Web文档的表示无关.通过Web页的文档对象模型(DOM)获得不同信息成分之间的位置关系,进而判断这些信息组件是否属于同一对象.实验结果表明,该方法对于多个领域中不同结构的Web文档具有很好的适应性.对于设计结构规则,含有多个数据对象的页面,抽取结果的准确率可以达到100%.
关键词：	信息检索 Web对象对象抽取空间关系
收稿时间：	2009/1/15 0:00:00
Object Extraction Based on Spatial-Relation of Entities from the World Wide Web

HAO Jing-min,LIAO Le-jian and HE Di.Object Extraction Based on Spatial-Relation of Entities from the World Wide Web[J].Journal of Beijing Institute of Technology(Natural Science Edition),2010,30(2):188-192.

Authors:	HAO Jing-min LIAO Le-jian and HE Di

Institution:	Beijing Laboratory of Intelligent Information Technology;School of Computer Science and Technology;Beijing Institute of Technology;Beijing 100081;China

Abstract:	The spatial distance between components within one object is always less than that between different objects in Web pages.A novel method of object extraction from the World Wide Web is reported.This proposed method considers mainly the layout characteristic of Web contents and is independent of underlying documentation representation such as HTML code.The method is based on document object model(DOM) to obtain the bounding-box of various kinds of Web information such as image,text or link.Then the distance ...

Keywords:	information retrieval Web object object extraction spatial configuration
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《北京理工大学学报》浏览原始摘要信息
	点击此处可从《北京理工大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏