网页正文信息抽取新方法 A new approach to content extraction from web page期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

网页正文信息抽取新方法

引用本文：	宋明秋,张瑞雪,吴新涛,李文立.网页正文信息抽取新方法[J].大连理工大学学报,2009,49(4):594-597.

作者姓名：	宋明秋张瑞雪吴新涛李文立

作者单位：	大连理工大学,系统工程研究所,辽宁,大连,116024

基金项目：	国家自然科学基金资助项目(70671016)

摘要：	基于包装器的信息抽取方法只能处理一种特定的信息源,而且对网页结构的依赖性强.基于此提出了一种将中文标点符号和HTML树结构作为识别网页正文内容重要特征的网页分析方法,通过统计中文标点符号确定部分正文信息,然后根据正文信息在结构上的相似性确定其他正文信息内容.实验结果表明该方法能有效地剔除网页噪音并提取网页正文,具有较好的通用性和较高的准确性.
关键词：	包装器 HTML树网页信息提取
A new approach to content extraction from web page

SONG Mingqiu,ZHANG Ruixue,WU Xintao,LI Wenli.A new approach to content extraction from web page[J].Journal of Dalian University of Technology,2009,49(4):594-597.

Authors:	SONG Mingqiu ZHANG Ruixue WU Xintao LI Wenli

Institution:	SONG Ming-qiu,ZHANG Rui-xue,WU Xin-tao,LI Wen-liInstitute of Systems Engineering,Dalian University of Technology,Dalian 116024,China

Abstract:	The approach to data extraction based on wrapper is limited to one specific information source,and greatly depends on web page structure.A new web page analysis method is proposed,which can recognize web page content according to the number of Chinese punctuations and HTML tree structure.It can eliminate noise and extract content from web page effectively.Parts of contents are confirmed by Chinese punctuations,while other parts are found by the similarity among contents.Experimental results show that this m...

Keywords:	wrapper HTML tree web information extraction
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《大连理工大学学报》浏览原始摘要信息
	点击此处可从《大连理工大学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏