首页 | 本学科首页   官方微博 | 高级检索  
     

基于维基百科的俄汉可比语料库构建及可比度计算
引用本文:原伟,易绵竹. 基于维基百科的俄汉可比语料库构建及可比度计算[J]. 山东大学学报(理学版), 2017, 52(9): 1-6. DOI: 10.6040/j.issn.1671-9352.0.2017.095
作者姓名:原伟  易绵竹
作者单位:1.上海外国语大学博士后流动站, 上海 200083;2.中国人民解放军外国语学院语言工程系, 河南 洛阳 471003
基金项目:国家社会科学基金资助项目(14CYY051);中国博士后科学基金面上资助项目(2017M610268)
摘    要:可比语料库由于其自身优势和广泛用途逐渐成为语料库研究的热点方向之一,而目前国内俄汉可比语料库相关研究未见学者涉及。通过梳理国内外相关研究成果,设计了一种基于维基百科构建俄汉可比语料库的思路和方法,研制了语料自动获取系统,以篇章对齐为基础建立了俄汉可比语料库,语料字(词)总数达到了百万级,最后利用跨语言相似度计算的方法对俄汉语料的可比度进行计算。计算结果表明该方法能够有效获取可比度较高的俄汉语料,所构建的语料库可被用于俄汉翻译、话语分析及计算语言学研究中。

关 键 词:可比语料库  俄语  维基百科  
收稿时间:2017-03-16

Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation
YUAN Wei,YI Mian-zhu. Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation[J]. Journal of Shandong University, 2017, 52(9): 1-6. DOI: 10.6040/j.issn.1671-9352.0.2017.095
Authors:YUAN Wei  YI Mian-zhu
Affiliation:1. Post-Doctoral Research Station of Shanghai International Studies University, Shanghai 200083, China;2. Language Engineering Department PLA University of Foreign Languages, Luoyang 471003, Henan, China
Abstract:Currently Russian and Chinese corpus research is urgently needed new breakthroughs in data sources, research angles and applications. Comparable corpus is one of the research hotspots in corpus linguistics and natural language processing. So far there has been no study of Russian-Chinese comparable corpora in China. This paper reviews the existing achievements in this area, designs an method to construct Russian-Chinese comparable corpus based on Wikipedia, develops a system for automatic acquiring comparable texts, and builds a Russian-Chinese comparable corpus, which contents more than a million words. In the end, the comparability of this comparable corpora was evaluated by using cross-language similarity calculation methods. The results demonstrate that using this method can effectively obtain Russian-Chinese comparable texts with high comparability, and the corpus can be used for translation, discourse analysis and computational linguistics studies.
Keywords:Russian  comparable corpora  Wikipedia  
本文献已被 CNKI 等数据库收录!
点击此处可从《山东大学学报(理学版)》浏览原始摘要信息
点击此处可从《山东大学学报(理学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号