首页 | 本学科首页   官方微博 | 高级检索  
     检索      

多策略融合的俄语文本词语提取方法研究
引用本文:唐菊香,孙怿晖,廖晓,刘建国,于娟.多策略融合的俄语文本词语提取方法研究[J].中国科技术语,2021,23(3):59-67.
作者姓名:唐菊香  孙怿晖  廖晓  刘建国  于娟
作者单位:1.福州大学经济与管理学院, 福建福州 3501082.广东金融学院互联网金融与信息工程学院, 广东广州 5105213.上海财经大学会计与财务研究院, 上海 200433
基金项目:国家自然科学基金项目“基于本体学习与本体映射的组织异构数据融合方法研究”(71771054)
摘    要:俄语是联合国工作语言之一,是俄罗斯等多个国家的官方语言。随着“一带一路”倡议的推进和全球化进程的加快,俄语文本数据成为有关组织管理决策的重要信息来源,俄语文本挖掘也因而成为重要的管理决策支持方法。然而,俄语文本挖掘方法研究目前还远未成熟,尤其是其关键基础——俄语文本词语提取的性能较低,阻碍着俄语文本建模的准确性。因此,文章提出一种多策略融合的俄语文本词语提取方法,结合俄语词性分析、语法规则和串频统计等多种方法,自动提取包含单词和短语在内的俄语词语。在联合国平行语料库和Taiga Corpus语料库上的实验结果表明,文章提出的方法在保证高召回率的同时,达到了85%以上的高准确率,显著优于常用的n-gram方法,能够为俄语文本主题发现和文本分/聚类等文本挖掘应用提供有效的词库。

关 键 词:俄语文本挖掘  词语提取  词性标注  频繁词串  
收稿时间:2021-05-11

Extracting Terms from Russian Texts Based on Multi Strategies
TANG Juxiang,SUN Yihui,LIAO Xiao,LIU Jianguo,YU Juan.Extracting Terms from Russian Texts Based on Multi Strategies[J].Chinese Science and Technology Terms Journal,2021,23(3):59-67.
Authors:TANG Juxiang  SUN Yihui  LIAO Xiao  LIU Jianguo  YU Juan
Abstract:Russian is one of the working languages of the United Nations and the official language of many countries including Russia. With the advancement of the Belt and Road Initiative and the acceleration of globalization, Russian text data has become an important information resource for managerial decision-making of related organizations and Russian text mining has thus become a significant decision-making method. However, Russian text mining methods are still far away from being mature, especially the essential Russian text term extraction method, which affects the accuracy of Russian text modeling. This paper proposes a Russian text term extraction method, which combines multi strategies including Russian POS analysis, grammatical rules and string frequency statistics to automatically extract Russian words and multiword expressions. Experiments on the United Nations Parallel Corpus and the Taiga Corpus show that the proposed method achieves a high accuracy of approximate 85% which is much higher than normal recall rate, such as the n-gram method. The proposed method can be used to create lexicons for Russian text mining applications such as text topic discovery, text classification, and text clustering.
Keywords:Russian text mining  term extraction  POS tag  frequent word-string  
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《中国科技术语》浏览原始摘要信息
点击此处可从《中国科技术语》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号