基于DOM树及行文本统计去噪的网页文本抽取技术 Content extraction from web page based on the DOM tree and line-text statistical noise-elimination期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于DOM树及行文本统计去噪的网页文本抽取技术

引用本文：	李霞,蒋盛益.基于DOM树及行文本统计去噪的网页文本抽取技术[J].山东大学学报(理学版),2012,47(3):38-42.

作者姓名：	李霞蒋盛益

作者单位：	广东外语外贸大学思科信息学院,广东广州,510006

基金项目：	国家自然科学基金资助项目(61070061);教育部人文社会科学研究青年基金资助项目(11YJCZH086);广州社科青年基金资助项目(11Q20)

摘要：	首先对网页源码文本统一编码转为UTF格式,然后把HTML网页文档转换为XML文档并解析为一棵DOM树。依据XML语言特点及噪声特征规则先对DOM树的噪声节点进行过滤删除,然后依据中文标点符号统计方法提取网页正文内容,并在此基础上利用行文本统计方法去除提取出的正文中存在的噪声信息,最后得到网页正文文本。对来自结构完全不同的主流与非主流的中英文新闻网站上的2 000篇网页进行实验,结果表明本文提出的方法具有较高的抽取准确率,并具有很好的通用性和实现简单的特点,适用于针对互联网中不同网站新闻文本信息的自动采集。
关键词：	网页文本抽取 DOM树行文本统计标点符号统计
Content extraction from web page based on the DOM tree and line-text statistical noise-elimination

LI Xia,JIANG Sheng-yi.Content extraction from web page based on the DOM tree and line-text statistical noise-elimination[J].Journal of Shandong University,2012,47(3):38-42.

Authors:	LI Xia JIANG Sheng-yi

Institution:	(Cisco School of Informatics,Guangdong University of Foreign Studies,Guangzhou 510006,Guangdong,China)

Abstract:	As different web pages have different codes,the HTML web page first need to be encoded with the uniform code UTF8,and then translated into an XML document which is parsed into the DOM tree.After removing some noise nodes from the DOM tree according to the features of XML language and the rules of the noise characteristics,text contents are extracted from the DOM tree by the method of statistics of punctuation and noise information is continued to be eliminated from contents extracted above by the method of statistics of line-text.The result of experiments on 2000 web pages obtained from different web sites shows that our method has high accuracy,great generality,and simplicity,and can be automatically used to extract the right contents from different web sites.

Keywords:	content extraction from web pages DOM tree statistical of line-text statistical of punctuation
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏