基于网页的站内信息采集技术的研究与实现 A Study and Implement of Intranet Gather Information Technology Based on Web Page期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于网页的站内信息采集技术的研究与实现

引用本文：	马志强,赵汐,贾鹏.基于网页的站内信息采集技术的研究与实现[J].内蒙古大学学报(自然科学版),2009,40(2).

作者姓名：	马志强赵汐贾鹏

作者单位：	1. 内蒙古工业大学信息工程学院,呼和浩特,010051 2. 东北大学秦皇岛分校,秦皇岛,066004 3. 呼和浩特铁路局包头西机务段,包头,014011

基金项目：	内蒙古工业大学科学研究项目

摘要：	实现站内搜索引擎的关键一步是信息的自动采集.站内信息采集技术是通过分析网页的HTML代码,获取网内的超链信息,使用广度优先搜索算法和增量存储算法,实现自动地连续分析链接、抓取文件、处理和保存数据的过程.系统在再次运行中通过应用属性对比技术,在一定程度上避免了对网页的重复分析和采集,提高了信息的更新速度和搜全率.
关键词：	信息采集广度优先搜索算法增量存储
A Study and Implement of Intranet Gather Information Technology Based on Web Page

MA Zhi-qiang,ZHAO Xi,JIA Peng.A Study and Implement of Intranet Gather Information Technology Based on Web Page[J].Acta Scientiarum Naturalium Universitatis Neimongol,2009,40(2).

Authors:	MA Zhi-qiang ZHAO Xi JIA Peng

Institution:	1.School of Information Engineering;Inner Mongolia University of Technology;Hohhot 010051;China;2.Northeastern University at Qinhuangdao;Qinghuangdao 066004;3.BAOTOUXI locomotive depot Hohhot Railway burea;Baotou 014011;China

Abstract:	The key step of implementation of intranet search engine is to gather information automatically.The intranet gathering information system realizes that it continuously analyzes hyperlinks,crawls files,processes and stores data by analyzing HTML codes,abstracting hyperlinks,designing the breadth-first search algorithm and increment memory algorithm.When the system runs again,the technology of attribute comparing is applied,the speed of update and the rate of recall are improved.

Keywords:	gather information breadth first search increment memory
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏