结构特征一致性约束的双语平行句对抽取 Extraction of bilingual parallel sentence pairs constrained by consistency of structural features期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

结构特征一致性约束的双语平行句对抽取

引用本文：	毛存礼,高旭,余正涛,王振晗,高盛祥,满志博.结构特征一致性约束的双语平行句对抽取[J].重庆大学学报(自然科学版),2021,44(1):46-56.

作者姓名：	毛存礼高旭余正涛王振晗高盛祥满志博

作者单位：	昆明理工大学信息工程与自动化学院,昆明 650500;昆明理工大学云南省人工智能重点实验室,昆明 650500;昆明理工大学信息工程与自动化学院,昆明 650500;昆明理工大学云南省人工智能重点实验室,昆明 650500;昆明理工大学信息工程与自动化学院,昆明 650500;昆明理工大学云南省人工智能重点实验室,昆明 650500;昆明理工大学信息工程与自动化学院,昆明 650500;昆明理工大学云南省人工智能重点实验室,昆明 650500;昆明理工大学信息工程与自动化学院,昆明 650500;昆明理工大学云南省人工智能重点实验室,昆明 650500;昆明理工大学信息工程与自动化学院,昆明 650500;昆明理工大学云南省人工智能重点实验室,昆明 650500

基金项目：	国家自然科学基金资助项目;云南省中青年学术和技术带头人后备人才资助项目;国家自然科学基金重点资助项目;云南省应用基础研究计划重点资助项目

摘要：	平行句对抽取是解决低资源神经机器翻译平行语料不足的有效途径.基于孪生神经网络的平行句对抽取方法的核心是通过跨语言语义相似度判断2个句子是否平行,在相似的语言对上取得了非常显著的效果.然而针对英语东南亚语言双语句对抽取任务,面临语言空间和句子长度存在较大差异,仅考虑跨语言语义相似度而忽略句子长度特征会导致模型对仅有语义包含关系但不平行句对的误判.笔者提出一种结构特征一致性约束的双语平行句对抽取方法,该方法是对基于孪生神经网络的双语平行句对抽取模型的扩展,首先通过多语言BERT预训练语言模型在嵌入层将两种语言编码到同一语义空间,以此缩小语义空间中语言的差异.其次分别对两种语言句子的长度特征进行编码,与孪生网络编码后的句子语义向量进行融合,增强平行句对在语义及结构特征上的表示,降低模型对语义相似但不平行句对的误判.在英缅双语数据集上进行实验,结果表明提出的方法相比基线模型准确率提高了4.64％,召回率提高了2.52％,F1值提高了3.51％.
关键词：	双语平行句对低资源语言 BERT预训练孪生网络结构
收稿时间：	2020/9/10 0:00:00
Extraction of bilingual parallel sentence pairs constrained by consistency of structural features

MAO Cunli,GAO Xu,YU Zhengtao,WANG Zhenhan,GAO Shengxiang,MAN Zhibo.Extraction of bilingual parallel sentence pairs constrained by consistency of structural features[J].Journal of Chongqing University(Natural Science Edition),2021,44(1):46-56.

Authors:	MAO Cunli GAO Xu YU Zhengtao WANG Zhenhan GAO Shengxiang MAN Zhibo

Institution:	Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, P. R. China;Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, P. R. China

Abstract:	Parallel sentence pair extraction is an effective way to solve the shortage of low-resource neural machine translation. The main method based on Siamese neural network is to judge whether two sentences are parallel through cross-language semantic similarity, which has achieved remarkable results on similar language pairs. However, for English- Southeast Asia language sentence pairs extraction tasks, there are not only great differences in language space but also great differences in sentence length. Considering only cross-language semantic similarity and ignoring sentence length features will lead to misjudgment of non-parallel sentence pairs with only semantic inclusion. Therefore, this paper proposes a parallel sentence pairs extraction method constrained by consistency of structural features. The method is an extension of the model based on Siamese neural network. Firstly, using the multilingual BERT to embed the two languages into the same semantic space in the embedding layer, so as to reduce the language differences in the semantic space. Secondly, embedding the length features of sentences respectively, and fusing it with the semantic vectors of sentences encoded by Siamese networks to enhance the representation of parallel sentence pairs in semantic and structural features, so as to solve the misjudgment problem. We experiment on the English-Burmese data sets. The results show that the precision is increased by 4.64%, the recall is increased by 2.52%, and the F₁ value is increased by 3.51% compared with the baseline.

Keywords:	parallel sentence low-resource BERT pretrain siamese network structural
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《重庆大学学报(自然科学版)》浏览原始摘要信息
	点击此处可从《重庆大学学报(自然科学版)》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏