首页 | 本学科首页   官方微博 | 高级检索  
     检索      

结构特征一致性约束的双语平行句对抽取
引用本文:毛存礼,高旭,余正涛,王振晗,高盛祥,满志博.结构特征一致性约束的双语平行句对抽取[J].重庆大学学报(自然科学版),2021,44(1):46-56.
作者姓名:毛存礼  高旭  余正涛  王振晗  高盛祥  满志博
作者单位:昆明理工大学 信息工程与自动化学院,昆明 650500;昆明理工大学 云南省人工智能重点实验室,昆明 650500;昆明理工大学 信息工程与自动化学院,昆明 650500;昆明理工大学 云南省人工智能重点实验室,昆明 650500;昆明理工大学 信息工程与自动化学院,昆明 650500;昆明理工大学 云南省人工智能重点实验室,昆明 650500;昆明理工大学 信息工程与自动化学院,昆明 650500;昆明理工大学 云南省人工智能重点实验室,昆明 650500;昆明理工大学 信息工程与自动化学院,昆明 650500;昆明理工大学 云南省人工智能重点实验室,昆明 650500;昆明理工大学 信息工程与自动化学院,昆明 650500;昆明理工大学 云南省人工智能重点实验室,昆明 650500
基金项目:国家自然科学基金资助项目;云南省中青年学术和技术带头人后备人才资助项目;国家自然科学基金重点资助项目;云南省应用基础研究计划重点资助项目
摘    要:平行句对抽取是解决低资源神经机器翻译平行语料不足的有效途径.基于孪生神经网络的平行句对抽取方法的核心是通过跨语言语义相似度判断2个句子是否平行,在相似的语言对上取得了非常显著的效果.然而针对英语东南亚语言双语句对抽取任务,面临语言空间和句子长度存在较大差异,仅考虑跨语言语义相似度而忽略句子长度特征会导致模型对仅有语义包含关系但不平行句对的误判.笔者提出一种结构特征一致性约束的双语平行句对抽取方法,该方法是对基于孪生神经网络的双语平行句对抽取模型的扩展,首先通过多语言BERT预训练语言模型在嵌入层将两种语言编码到同一语义空间,以此缩小语义空间中语言的差异.其次分别对两种语言句子的长度特征进行编码,与孪生网络编码后的句子语义向量进行融合,增强平行句对在语义及结构特征上的表示,降低模型对语义相似但不平行句对的误判.在英缅双语数据集上进行实验,结果表明提出的方法相比基线模型准确率提高了4.64%,召回率提高了2.52%,F1值提高了3.51%.

关 键 词:双语平行句对  低资源语言  BERT预训练  孪生网络  结构
收稿时间:2020/9/10 0:00:00

Extraction of bilingual parallel sentence pairs constrained by consistency of structural features
MAO Cunli,GAO Xu,YU Zhengtao,WANG Zhenhan,GAO Shengxiang,MAN Zhibo.Extraction of bilingual parallel sentence pairs constrained by consistency of structural features[J].Journal of Chongqing University(Natural Science Edition),2021,44(1):46-56.
Authors:MAO Cunli  GAO Xu  YU Zhengtao  WANG Zhenhan  GAO Shengxiang  MAN Zhibo
Institution:Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, P. R. China
Abstract:Parallel sentence pair extraction is an effective way to solve the shortage of low-resource neural machine translation. The main method based on Siamese neural network is to judge whether two sentences are parallel through cross-language semantic similarity, which has achieved remarkable results on similar language pairs. However, for English- Southeast Asia language sentence pairs extraction tasks, there are not only great differences in language space but also great differences in sentence length. Considering only cross-language semantic similarity and ignoring sentence length features will lead to misjudgment of non-parallel sentence pairs with only semantic inclusion. Therefore, this paper proposes a parallel sentence pairs extraction method constrained by consistency of structural features. The method is an extension of the model based on Siamese neural network. Firstly, using the multilingual BERT to embed the two languages into the same semantic space in the embedding layer, so as to reduce the language differences in the semantic space. Secondly, embedding the length features of sentences respectively, and fusing it with the semantic vectors of sentences encoded by Siamese networks to enhance the representation of parallel sentence pairs in semantic and structural features, so as to solve the misjudgment problem. We experiment on the English-Burmese data sets. The results show that the precision is increased by 4.64%, the recall is increased by 2.52%, and the F1 value is increased by 3.51% compared with the baseline.
Keywords:parallel sentence  low-resource  BERT pretrain  siamese network  structural
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《重庆大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《重庆大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号