首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于语境相似度的中文分词一致性检验研究
引用本文:刘伟,黄锴宇,余浩,黄德根.基于语境相似度的中文分词一致性检验研究[J].北京大学学报(自然科学版),2022,58(1):99-105.
作者姓名:刘伟  黄锴宇  余浩  黄德根
作者单位:大连理工大学计算机科学与技术学院, 大连 116023
基金项目:国家自然科学基金(U1936109,61672127)资助;
摘    要:提出一种基于语境相似度的中文分词一致性检验方法。首先利用词法和句法层面的特征, 设计基于构词、词性和依存句法的分类规则, 再使用预训练词向量, 对不一致字串所在语境的语义信息进行编码, 通过语境间的语义相似度对不一致字串进行分类。在人工构建的36万字分词语料库中进行分词一致性检验, 结果表明该方法能够有效地提高中文分词一致性检验的准确率。进一步地, 使用3 种主流中文分词模型在修正一致性后的分词语料中重新训练和测试, 结果表明该方法可以有效地提高分词语料库的质量, 3种中文分词模型的F1值分别提高1.18%, 1.25%和1.04%。

关 键 词:中文分词  一致性检验  语料库构建  语境相似度  
收稿时间:2021-06-08

Consistency Check for Chinese Word Segmentation via Contextual Similarity
LIU Wei,HUANG Kaiyu,YU Hao,HUANG Degen.Consistency Check for Chinese Word Segmentation via Contextual Similarity[J].Acta Scientiarum Naturalium Universitatis Pekinensis,2022,58(1):99-105.
Authors:LIU Wei  HUANG Kaiyu  YU Hao  HUANG Degen
Institution:School of Computer Science and Technology, Dalian University of Technology, Dalian 116023
Abstract:The authors propose a method of consistency check for Chinese word segmentation based on contextual similarity. First, the classification constraints based on word formation, part of speech and dependency syntax are designed by using the features of morphology and syntax. Then, the semantic information of the context in which the inconsistent strings are located is encoded by using pretrained word embeddings, and the inconsistent strings are classified by semantic similarity between contexts. Experimental results show that proposed method can effectively improve the accuracy of consistency check for Chinese word segmentation. Further, three mainstream Chinese word segmentation models are used to re-implement in the revised Chinese word segmentation corpus. The result shows that proposed method can effectively improve the quality of Chinese word segmentation corpus, and the F1 scores of three Chinese word segmentation models are improved by 1.18%, 1.25% and 1.04% respectively.
Keywords:
点击此处可从《北京大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《北京大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号