首页 | 本学科首页   官方微博 | 高级检索  
     

基于多策略过滤的汉日多词短语抽取和对齐
引用本文:唐亮,李倩,许洪波,易绵竹. 基于多策略过滤的汉日多词短语抽取和对齐[J]. 山东大学学报(理学版), 2015, 50(9): 21-28. DOI: 10.6040/j.issn.1671-9352.3.2014.016
作者姓名:唐亮  李倩  许洪波  易绵竹
作者单位:1. 洛阳外国语学院语言工程系, 河南 洛阳 471003;
2. 中科院计算技术研究所, 北京 100049
基金项目:国家重点基础研究发展计划(973计划)项目(2014CB340400,2012CB316303);国家自然科学基金重点项目(61232010);国家自然科学基金面上项目(61173064);国家科技支撑计划项目(2012BAH39B04)
摘    要:在跨语言文本分析任务中,多词短语比单个词汇歧义小,语义表达更加准确,有助于提高文本理解的准确性。现有方法主要关注单个词的跨语言对齐。将多词短语抽取和跨语言对齐相融合,提出了一种基于多策略过滤的汉日多词短语抽取和对齐的方法。首先从一个语种出发,通过重复串、左右邻接熵、内部关联度、多词嵌套、停用词等方法提取并过滤得到具备完整语义的多词短语,然后利用平行语料库计算汉日多词短语的相似度,实现跨语言对齐。在整个过程中可结合日语语言规则与特点,根据语料规模、相关领域对过滤阈值进行动态调整,提高了多词短语的领域适用性。实验结果表明,该方法可有效抽取汉日多词短语并进行准确对齐,以多词短语为对齐单元,语义表达更完整,实用价值更大。

关 键 词:平行语料库  多词短语  词对齐  
收稿时间:2015-03-03

Chinese-Japanese multi-word phrase extraction and alignment based on multi-strategy filtering
TANG Liang,LI Qian,XU Hong-bo,YI Mian-zhu. Chinese-Japanese multi-word phrase extraction and alignment based on multi-strategy filtering[J]. Journal of Shandong University, 2015, 50(9): 21-28. DOI: 10.6040/j.issn.1671-9352.3.2014.016
Authors:TANG Liang  LI Qian  XU Hong-bo  YI Mian-zhu
Affiliation:1. Department of Language Engineering, Luoyang University of Foreign Language, Luoyang 471003, Henan, China;
2. Institute of Computing Technology, Chinese Academy of Science, Beijing 100049, China
Abstract:On the task of cross-language text analysis, a multi-word phrase is less ambiguous and more accurate than a single word, which helps to understand the text more accurately. Existing methods mainly focus on cross-language alignment of single words. This paper presents an extraction and alignment method for Chinese-Japanese multi-word phrases based on multi-strategy filtering, which combines the multi-word phrases extraction and cross-language alignment. Firstly, we get multi-word phrases with complete semantics using repeated string, left-right adjacent entropy, internal relationship, multi-word nesting, stop-word method etc. Secondly, we use the parallel corpus to compute the similarity of Chinese-Japanese multi-word phrases, to achieve cross-language alignment. In the process, according to the rules and characteristics of the Japanese language, we dynamically adjust the threshold according to corpus' size and related domains, in order to improve the applicability of multi-word phrases. The experimental results show that this method is effective to extract Chinese-Japanese multi-word phrases as the alignment unit, which makes the semantic expression more complete and more practical value.
Keywords:parallel corpus  multi-word phrase  word alignment  
本文献已被 万方数据 等数据库收录!
点击此处可从《山东大学学报(理学版)》浏览原始摘要信息
点击此处可从《山东大学学报(理学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号