首页 | 本学科首页   官方微博 | 高级检索  
     检索      

复述平行语料构建及其应用方法研究
引用本文:王雅松,刘明童,张玉洁,徐金安,陈钰枫.复述平行语料构建及其应用方法研究[J].北京大学学报(自然科学版),2021,57(1):68-74.
作者姓名:王雅松  刘明童  张玉洁  徐金安  陈钰枫
作者单位:北京交通大学计算机与信息技术学院, 北京 100044
摘    要:以汉语为研究对象, 提出构建大规模高质量汉语复述平行语料的方法。基于翻译引擎进行复述数据增强, 将英语复述平行语料迁移到汉语中, 同时人工构建汉语复述评测数据集。基于构建的汉语复述数据, 在复述识别和自然语言推理任务中验证复述数据构建及其应用方法的有效性。首先基于复述语料生成复述识别数据集, 预训练基于注意力机制的神经网络句子匹配模型, 训练模型捕获复述信息, 然后将预训练的模型用于自然语言推理任务, 改进其性能。在自然语言推理公开数据集上的评测结果表明, 所构建的复述语料可有效地应用在复述识别任务中, 模型可以学习复述知识。应用在自然语言推理任务中时, 复述知识能有效地提升自然语言推理模型的精度, 从而验证了复述知识对下游语义理解任务的有效性。所提出的复述语料构建方法不依赖语种, 可为其他语言和领域提供更多的训练数据, 生成高质量的复述数据, 改进其他任务的性能。

关 键 词:复述语料构建  数据增强  迁移学习  复述识别  自然语言推理  
收稿时间:2020-06-07

Research on the Construction and Application of Paraphrase Parallel Corpus
WANG Yasong,LIU Mingtong,ZHANG Yujie,XU Jin'an,CHEN Yufeng.Research on the Construction and Application of Paraphrase Parallel Corpus[J].Acta Scientiarum Naturalium Universitatis Pekinensis,2021,57(1):68-74.
Authors:WANG Yasong  LIU Mingtong  ZHANG Yujie  XU Jin'an  CHEN Yufeng
Institution:School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044
Abstract:Taking Chinese as the research object, the authors put forward the method to construct large-scale and high-quality paraphrase parallel corpora. The paraphrase data augmentation method include transfering English paraphrase corpus to Chinese, by using the method of translation engines, and manually annotating evaluation data set. Based on the constructed Chinese paraphrase data, the validity of the paraphrase data construction application method is verified in the paraphrase recognition task and natural language inference task. Firstly, the paraphrase recognition data is generated based on the constructed paraphrase corpus, and the attention-based neural network model of sentence matching is pre-trained to capture the paraphrase information. Then, the pre-trained model is applied to the natural language inference task to improve the performance. The experimental results on the open set show that the constructed paraphrase corpus can be effectively applied to the paraphrase recognition task, and the model can learn paraphrase knowledge. When applied to natural language inference task, paraphrase knowledge can effectively improve the accuracy of natural language inference models and verify the effectiveness of paraphrase knowledge for downstream semantic understanding tasks. Meanwhile, the proposed construction method for the paraphrase corpus is language-independent, which can provide more training data for other languages and fields, generate high-quality paraphrase data, and further improve the performance of other tasks.
Keywords:paraphrase corpus construction  data augmentation  transfer learning  paraphrase recognition  natural language inference
  
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《北京大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《北京大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号