首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于词性标注规则的马铃薯文献信息抽取方法
引用本文:王腾阳,赵小丹,胡林.基于词性标注规则的马铃薯文献信息抽取方法[J].科学技术与工程,2023,23(27):11562-11569.
作者姓名:王腾阳  赵小丹  胡林
作者单位:中国农业科学院农业信息研究所
基金项目:内蒙古自治区科技重大专项课题(2021SZD0026)
摘    要:马铃薯育种领域积累有大量尚未结构化处理的育种文献文本,人工整理文献内的种质资源数据费时费力。为了快速、准确地从育种文献中提取种植资源数据,使用基于词性标注规则和预设词的方法抽取文献数据。文献格式为PDF文档,对于不能直接获取文档文本的情况,使用游程平滑算法和光学字符识别(Optical Character Recognition, OCR)获取文本内容。采用用户可灵活建立的关键词库保存抽取项,通过正则表达式获取关键词所在语句,并利用自然语言处理工具对语句进行分词与词性标注,根据规则抽取目标词,同时采用基于关键词与预设词距离的信息抽取方法,实现将育种文献从自由文本转化为结构化数据。对115篇文献的1490个抽取项进行信息抽取,实验表明,该方法的准确率为82.97%,召回率为99.72%,F值为90.58%,能以较高的准确率和召回率对马铃薯育种文献种质资源进行抽取,可为构建马铃薯遗传育种数据库提供数据基础。

关 键 词:马铃薯  词性标注  信息抽取  自然语言处理  
收稿时间:2023/1/4 0:00:00
修稿时间:2023/7/6 0:00:00

A method of potato breeding literature information extraction based on part of speech tagging rules
Wang Tengyang,Zhao Xiaodan,Hu Lin.A method of potato breeding literature information extraction based on part of speech tagging rules[J].Science Technology and Engineering,2023,23(27):11562-11569.
Authors:Wang Tengyang  Zhao Xiaodan  Hu Lin
Institution:Agricultural Information Institute of CAAS
Abstract:The potato breeding had accumulated a large number of unstructured literature texts. Manual collation of germplasm resource data from literature is time-consuming and labor-intensive. To swiftly and accurately extract data on plant resources from breeding literature, a method utilizing part-of-speech tagging rules and predetermined vocabulary is employed for data extraction. The document format is PDF. For those cannot obtain document text directly, use run length smoothing algorithm and Optical Character Recognition(OCR) was used to obtain the text content. The method of information extraction used word -based marking rules and preset words. A user-configurable keyword repository is utilized to preserve extraction elements. By employing regular expressions, sentences containing the keywords are acquired, and natural language processing tools are used for tokenization and part-of-speech tagging of the sentences. Target words are extracted according to specific rules, while an information extraction method based on the distance between keywords and pre-established words is implemented. This approach facilitates the conversion of breeding literature from unstructured text into structured data. Information extraction of 1490 extracted items from 115 articles, shows that the accuracy rate of this method is 82.97%, the recall rate is 99.72%, and the F value is 90.58%. It can be extracted for potato breeding documents at a high accuracy and recall rate. It provides a data basis for the construction of potato genetic breeding databases.
Keywords:Potato  Part-of-speech tagging  Information extraction  Natural language processing
点击此处可从《科学技术与工程》浏览原始摘要信息
点击此处可从《科学技术与工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号