首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于条件随机场的中文科研论文信息抽取
引用本文:于江德,樊孝忠,尹继豪.基于条件随机场的中文科研论文信息抽取[J].华南理工大学学报(自然科学版),2007,35(9):90-94,106.
作者姓名:于江德  樊孝忠  尹继豪
作者单位:北京理工大学,计算机科学技术学院,北京,100081
基金项目:高等学校博士学科点专项科研项目
摘    要:科研论文头部信息和引文信息对基于域的论文检索、统计和引用分析是必不可少的.由于隐马尔可夫模型不能充分利用对抽取有用的上下文特征,因此文中提出了一种基于条件随机场的中文科研论文头部和引文信息抽取方法,该方法的关键在于模型参数估计和特征选择.实验中采用L-BFGS算法学习模型参数,并选择局部、版面、词典和状态转移4类特征作为模型特征集.在信息抽取时先利用分隔符、特定标识符等格式信息对文本进行分块,在分块基础上用条件随机场进行指定域的抽取.实验表明,该方法抽取性能明显优于基于隐马尔可夫模型的方法,且加入不同的特征集对抽取性能提升作用不同.

关 键 词:信息抽取  条件随机场  引文信息  论文头部信息
文章编号:1000-565X(2007)09-0090-05
修稿时间:2006-11-27

Information Extraction from Chinese Research Papers Based on Conditional Random Fields
Yu Jiang-de,Fan Xiao-zhong,Yin Ji-hao.Information Extraction from Chinese Research Papers Based on Conditional Random Fields[J].Journal of South China University of Technology(Natural Science Edition),2007,35(9):90-94,106.
Authors:Yu Jiang-de  Fan Xiao-zhong  Yin Ji-hao
Institution:School of Computer Science and Tech.,Beijing Institute of Tech.,Beijing 100081,China
Abstract:The information of headers and citations of research papers is necessary for many applications,such as the field-based paper search,the paper statistics and the citation analysis.In order to enhance the utilization of context features for information extraction which is greatly restricted by the hidden Markov model(HMM),a method based on the conditional random fields(CRFs) is proposed to extract the information of paper header and citation from Chinese research papers.The proposed method,whose key is the parameter estimation and the feature selection,employs L-BFGS algorithm for the estimation of model parameters in the experiment and selects the categories features of location,layout,lexicon and state transition as the feature set of the model.During the information extraction,the format information about list separators and special-labels is used to segment the text,and then CRFs are applied to the extraction in special fields.Experimental results show that the proposed method possesses better performance than that based on the HMM,and that the performance improvement varies with the features sets.
Keywords:information extraction  conditional random field  citation information  paper header information
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号