基于关键词提取的娱乐新闻文档去重算法 Algorithm of Weeding Replicated Entertainment News by Keyword Extraction期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于关键词提取的娱乐新闻文档去重算法

引用本文：	沙芸,张国英,孟凡亮.基于关键词提取的娱乐新闻文档去重算法[J].广西师范大学学报(自然科学版),2007,25(2):30-33.

作者姓名：	沙芸张国英孟凡亮

作者单位：	北京石油化工学院,计算机系,北京,102617

基金项目：	国家部委预研项目 , 北京市教委科研项目

摘要：	去除内容相同或相近的新闻是提高搜索引擎的关键技术之一.提出一种基于关键词提取的新闻去重算法,通过以标题为种子点构建词汇链的方法,能够找到对主题贡献大的非高频词,从而抽取出完整文档关键词集合,该方法能够基于小规模语料库识别新词;为了提高网页去重速度和质量,基于关键词建立去重倒排文档.实验结果显示,该方法与传统方法相比排斥错误率降低了5%,去重时间缩短了20%～30%.
关键词：	关键词提取新词识别文档相似度
文章编号：	1001-6600（2007）02-0030-04
收稿时间：	2006-12-15
修稿时间：	2006-12-15
Algorithm of Weeding Replicated Entertainment News by Keyword Extraction

SHA Yun,ZHANG Guo-ying,MENG Fan-liang.Algorithm of Weeding Replicated Entertainment News by Keyword Extraction[J].Journal of Guangxi Normal University(Natural Science Edition),2007,25(2):30-33.

Authors:	SHA Yun ZHANG Guo-ying MENG Fan-liang

Institution:	Department of Computer,Beijing Institute of Petrochemical Technology,Beijing 102617,China

Abstract:	Weeding out duplicated news is an important technique of search engine. A new algorithm to weed duplicated news is proposed using,the keyword extraction. The algorithm uses title as seeds to build lexical chain,can obtain integrated keywords set by screening out important but low occurrence words ,and recognizes unknown words by small scale corpus. In order to improve the speed and quality of weeding,the invert document is established by screened keywords. The experimental result shows that exclusive error rate of this algorithm is lower 5 % than that of classical algorithms ,and the time of weeding duplicated news drops 20-30%.

Keywords:	keywords extraction unknown word recognition ~document similarity
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏