首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种基于词共现图的文档主题词自动抽取方法
引用本文:耿焕同,蔡庆生,于琨,赵鹏.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报(自然科学版),2006,42(2):156-162.
作者姓名:耿焕同  蔡庆生  于琨  赵鹏
作者单位:中国科学技术大学计算机科学技术系,中国科学技术大学计算机科学技术系,中国科学技术大学计算机科学技术系,中国科学技术大学计算机科学技术系 合肥230027,安徽师范大学计算机系,芜湖,241000,合肥230027,合肥230027,合肥230027
基金项目:国家高技术研究发展计划(863计划);安徽省教育厅自然科学基金
摘    要:主题词抽取是文本自动处理的基础性工作.在对现有主题词抽取方法深入研究的基础上,提出了一种基于词共现图的文档主题词自动抽取方法;该方法以基于词频统计方法为基础,利用在词共现图形成的主题信息以及不同主题间的连接特征信息自动地提取文档中的主题词,旨在找出一些非高频词且又对主题贡献大的词.实验表明了该抽取方法抽取出的主题词更能准确地符合了作者的主题.

关 键 词:自然语言处理  词共现图  主题词
收稿时间:04 26 2005 12:00AM

A Kind of Automatic Text Keyphrase Extraction Method Based on Word Co-occurrence
Geng Huan-Tong,Cai Qing-Sheng,Yu Kun,Zhao Peng.A Kind of Automatic Text Keyphrase Extraction Method Based on Word Co-occurrence[J].Journal of Nanjing University: Nat Sci Ed,2006,42(2):156-162.
Authors:Geng Huan-Tong  Cai Qing-Sheng  Yu Kun  Zhao Peng
Institution:1. Department of Computer Science and Technology, University of Science and Technology of China, Hefei, 230027,Chinas ;2.Department of Computer Science , Anhui Normal University, Wuhu, 241000, China
Abstract:Advances in high-volume storage media have led to an explosion in the amount of machine readable text.Keyphrase extraction is one of the fundamental works of natural language processing.In this paper,a novel automatic text keyphrase extraction method based on word co-occurrence is put forward on the basis of the research of existing keyphrase extraction method.The method,based on word frequency statistics utilizes text subject information based on word co-occurrence graph and linkage information of different text subjects.Our goal is to extract keyphrases with content most accurately matching specific and unique interest of the user.This algorithm for extracting keyphrases represents the asserted main point in a document,without relying on external devices such as natural language processing tools or a document corpus.Our algorithm is based on the segmentation of a graph,representing the co-occurrence between terms in a document,into clusters.Each cluster corresponds to a concept on which author's idea is based,and the top ranked terms on statistical basis.The relationship between each term to these clusters is selected as keyphrases.The experimental results show that thus extracted terms match author's point quite accurately,even though this method does not use the average frequency of each term in a corpus,i.e.,this method is a content sensitive,domain independent device of indexing.Its purpose finds the words of nonfrequency but great contribution to text subject. The greatest benefit is the extraction of nonfrequency words which carry the effect of the document,i.e.,concepts preseuted by the author.This merit can lead to the satisfaction of search engine users with unique interests or ideas.
Keywords:TFIDF
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号