首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种解决命名实体识别数据集类别标记失衡的方法
引用本文:许丽丹,刘嘉勇,何祥.一种解决命名实体识别数据集类别标记失衡的方法[J].四川大学学报(自然科学版),2020,57(1):82-88.
作者姓名:许丽丹  刘嘉勇  何祥
作者单位:四川大学网络空间安全学院,成都 610065;四川大学电子信息学院,成都610065
基金项目:中国科学院网络测评技术重点实验室开放课题基金“面向非结构化数据的威胁情报知识图谱构建”(NST 18 001)
摘    要:命名实体识别研究中常见的公开数据集普遍存在数据类别标记不平衡的问题,限制了基于统计学习模型方法性能的进一步提高.针对上述问题,提出了基于遗传算法的数据类别标记平衡方法.该方法基于原始数据集中已有的标记数据,通过修改遗传算法中的指标适应度函数和基因组合规则,合成类别分布均衡的文本用以扩充原始数据集,降低标记数据不平衡性从而改善命名实体识别的效果.为验证该方法的有效性,采用Bi-LSTM-CRF模型分别基于CoNLL 2003及JNLPBA数据集设计了该方法与平衡欠采样、随机过采样方法的对比实验.从实验中发现,提出的方法在CoNLL2003数据集上模型召回率提高3.26%,F_1值提高1.70%;在JNLPBA数据集上召回率提高2.44%,F_1值提高1.03%.实验结果表明,提出的方法能够有效地缓解类别标记失衡问题达到提高命名实体识别效果的目的.

关 键 词:命名实体识别  类别失衡  数据合成  统计学习模型  遗传算法
收稿时间:2019/5/11 0:00:00
修稿时间:2019/8/30 0:00:00

Method for solving class imbalance of named entity recognition dataset
xu lidan,Liu jiayong and He xiang.Method for solving class imbalance of named entity recognition dataset[J].Journal of Sichuan University (Natural Science Edition),2020,57(1):82-88.
Authors:xu lidan  Liu jiayong and He xiang
Institution:sichuan university College of cybersecurity,College of Cybersecurity, Sichuan University,College of Electronics and Information Engineering, Sichuan University
Abstract:The public data sets in named entity recognition research are often class label imbalanced,which limits the further performance improvement based on statistical learning model methods. Aiming at the above problems, a data class label balancing method based on genetic algorithm is proposed, which modifies the fitness function and gene combination rules tried to balance the dataset by generating new samples to augment the original dataset. In order to verify the validity, the proposed method was compared with the balanced undersampling method and the random oversampling method by using the Bi LSTM CRF model on the CoNLL 2003 and JNLPBA datasets respectively. The results show that the proposed method increased the recall rate by 3.26% and the F1 value by 1.70% on the CoNLL2003 dataset, and the recall rate by 2.44% and the F1 value by 1.03% on the JNLPBA dataset. The experimental results demonstrate that the proposed method can effectively alleviate the class imbalance and improves the effect of named entity recognition.
Keywords:Named entity recognition  Class imbalance  Data synthesis  Statistical learning model  Genetic algorithm
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《四川大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《四川大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号