首页 | 本学科首页   官方微博 | 高级检索  
     

基于关键词提取的娱乐新闻文档去重算法
引用本文:沙芸,张国英,孟凡亮. 基于关键词提取的娱乐新闻文档去重算法[J]. 广西师范大学学报(自然科学版), 2007, 25(2): 30-33
作者姓名:沙芸  张国英  孟凡亮
作者单位:北京石油化工学院,计算机系,北京,102617;北京石油化工学院,计算机系,北京,102617;北京石油化工学院,计算机系,北京,102617
基金项目:国家部委预研项目 , 北京市教委科研项目
摘    要:去除内容相同或相近的新闻是提高搜索引擎的关键技术之一.提出一种基于关键词提取的新闻去重算法,通过以标题为种子点构建词汇链的方法,能够找到对主题贡献大的非高频词,从而抽取出完整文档关键词集合,该方法能够基于小规模语料库识别新词;为了提高网页去重速度和质量,基于关键词建立去重倒排文档.实验结果显示,该方法与传统方法相比排斥错误率降低了5%,去重时间缩短了20%~30%.

关 键 词:关键词提取  新词识别  文档相似度
文章编号:1001-6600(2007)02-0030-04
收稿时间:2006-12-15
修稿时间:2006-12-15

Algorithm of Weeding Replicated Entertainment News by Keyword Extraction
SHA Yun,ZHANG Guo-ying,MENG Fan-liang. Algorithm of Weeding Replicated Entertainment News by Keyword Extraction[J]. Journal of Guangxi Normal University(Natural Science Edition), 2007, 25(2): 30-33
Authors:SHA Yun  ZHANG Guo-ying  MENG Fan-liang
Affiliation:Department of Computer,Beijing Institute of Petrochemical Technology,Beijing 102617,China
Abstract:Weeding out duplicated news is an important technique of search engine. A new algorithm to weed duplicated news is proposed using,the keyword extraction. The algorithm uses title as seeds to build lexical chain,can obtain integrated keywords set by screening out important but low occurrence words ,and recognizes unknown words by small scale corpus. In order to improve the speed and quality of weeding,the invert document is established by screened keywords. The experimental result shows that exclusive error rate of this algorithm is lower 5 % than that of classical algorithms ,and the time of weeding duplicated news drops 20-30%.
Keywords:keywords extraction  unknown word recognition ~document similarity
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号