首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于网络文本的汉语多词表达抽取方法
引用本文:龚双双,陈钰枫,徐金安,张玉洁.基于网络文本的汉语多词表达抽取方法[J].山东大学学报(理学版),2018,53(9):40-48.
作者姓名:龚双双  陈钰枫  徐金安  张玉洁
作者单位:北京交通大学计算机与信息技术学院, 北京 100044
基金项目:国家自然科学基金资助项目(61473294,61370130);北京市自然科学基金资助项目(4172047);中央高校基本科研业务费专项资金资助项目(2015JBM033)
摘    要:多词表达(multiword expressions, MWEs)是自然语言中一类固定或半固定搭配的语言单元,特别在网络文本中,多词表达频繁出现,给分词和后续文本理解带来了巨大挑战,因此,面向网络文本提出了一种双层抽取策略来实现多词表达的识别。第一层次,利用基于左右熵联合增强互信息的算法来实现多词表达的初步抽取;第二层次,在第一层次获得的多词表达候选列表的基础上,利用SVM分类器,构建上下文和词向量特征,进行多词表达与非多词表达的分类,实现多词表达候选列表的进一步过滤。经过实验测试,在5 000条微博语料上,第一层次获得的多词表达的F值为84.92%,第二层次多词表达识别的F值为89.58%,相比于基线系统,性能有很大的提升。实验结果表明,双层抽取策略能够实现网络多词表达的有效抽取,并能有效改善分词结果。

关 键 词:多词表达  左右熵  分词  增强互信息  SVM  
收稿时间:2017-12-12

Extraction of Chinese multiword expressions based on Web text
GONG Shuang-shuang,CHEN Yu-feng,XU Jin-an,ZHANG Yu-jie.Extraction of Chinese multiword expressions based on Web text[J].Journal of Shandong University,2018,53(9):40-48.
Authors:GONG Shuang-shuang  CHEN Yu-feng  XU Jin-an  ZHANG Yu-jie
Institution:College of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
Abstract:A Multiword Expression is a kind of fixed and semi-fixed collocation in natural language, especially in network text, MWEs appear frequently, which brings a great challenge to the subsequent segmentation and text comprehension. Therefore, we propose a double-layer extraction strategy to achieve the recognition of MWEs in this paper. In the first layer, we use the LRE+EMI algorithm to achieve the initial extraction of MWEs; In the second layer, we use SVM classifier and construct the characteristics of context and word vector to classify the MWEs and non-MWEs, in order to further filter the MWEs candidate list on the basis of the MWEs candidate list got from the first layer. After the experiment, the F value of MWEs reached 84.92% in the first layer and the F value of MWEs reached 89.58% in the second layer, which have greatly improved performance compared with the baseline system. The experimental result shows that the double-layer extraction strategy can availably extract MWEs, and can effectively improve the segmentation results.
Keywords:SVM  MWEs  left and right entropy  enhanced mutual information  word segmentation  
本文献已被 CNKI 等数据库收录!
点击此处可从《山东大学学报(理学版)》浏览原始摘要信息
点击此处可从《山东大学学报(理学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号