基于分块的网页主题信息自动提取算法 An automatic extraction algorithm of Web pages topical information based on blocks期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于分块的网页主题信息自动提取算法

引用本文：	殷贤亮,李猛.基于分块的网页主题信息自动提取算法[J].华中科技大学学报(自然科学版),2007,35(10):39-41.

作者姓名：	殷贤亮李猛

作者单位：	华中科技大学,计算机科学与技术学院,湖北,武汉,430074

摘要：	对互联网上大量存在的基于模板的网页,根据其半结构化的特点,提出了一种网页分块和主题信息自动提取算法.该算法利用网页标记对网页进行分块,改进了传统的文本特征选择方法,把网页块表示成特征向量,并根据有序标记集识别主题内容块.用该算法改进了网页分类的预处理过程,提高了分类的速度和准确性.实验表明,对网页进行主题信息提取后再进行分类,可以提高分类系统的查全率和查准率.
关键词：	网页分块主题信息自动提取特征选择网页分类网页分类主题信息自动提取算法 blocks based information pages extraction algorithm 查准率查全率分类系统信息提取实验速度处理过程算法改进识别标记集有序
文章编号：	1671-4512（2007）10-0039-03
修稿时间：	2006-08-30
An automatic extraction algorithm of Web pages topical information based on blocks

Yin Xianliang,Li Meng.An automatic extraction algorithm of Web pages topical information based on blocks[J].JOURNAL OF HUAZHONG UNIVERSITY OF SCIENCE AND TECHNOLOGY.NATURE SCIENCE,2007,35(10):39-41.

Authors:	Yin Xianliang Li Meng

Institution:	School of Computer Science and Technology, Huazhong University of Science and Teehnololgy, Wuhan 430074, China

Abstract:	According to the semi-structure of the template-based Web pages in the Internet,an algorithm which can identify the topic content blocks was proposed.In this algorithm,the Web-page is segmented according to the HTML tags,and the Web page block is represented as feature vector,which improved the traditional text feature selection method.After using the Algorithm in the pretreatment of Web page classification,the speed and correctness of the classification was improved a lot.Experiment shows that the algorithm can improve the precision and recall of a classification after the topic content extraction procedure.

Keywords:	Web-page segmentation topic content information automate extraction feature selection Web page classification
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏