首页 | 本学科首页   官方微博 | 高级检索  
     检索      

融合中文字形和字义的字向量表示方法
引用本文:唐善成,张雪,张镤月,王瀚博,陈明.融合中文字形和字义的字向量表示方法[J].科学技术与工程,2021,21(32):13787-13792.
作者姓名:唐善成  张雪  张镤月  王瀚博  陈明
作者单位:西安科技大学 通信与信息工程学院
基金项目:国家重点研发计划项目(2018YFC0808300);陕西省科技计划重点产业创新链(群)项目(2020ZDLGY15-07);西安市科技计划科技创新引导项目(201805036YD14CG20(4))
摘    要:字向量表示质量对中文文本处理方法有重要影响。目前,常用中文字向量表示方法Word2Vec、GloVe在很多任务中表现优异,但存在向量质量依赖训练数据集、稳定性差、没有考虑汉字整体字形结构所隐含的语义信息、没有利用字典包含的语言知识等问题。为了克服现有方法的不足,该文首先采用字形自编码器自动捕获汉字字形蕴含的语义,再利用字义自编码器抽取字典包含的稳定字义信息,提出了融合中文字形和字义的字向量的表示方法(Glyph and Meaning to Vector)。结果表明,GnM2Vec在近邻字计算、中文命名实体识别和中文分词三项任务中均取得了较好的结果,在命名实体识别中,F1值较GloVe、word2vec、G2Vec(基于字形向量)分别提高了2.25、0.05、0.3;在中文分词中,F1值分别提高了0.3、0.14、0.33,提高了字向量稳定性。

关 键 词:字向量表示    字形  字义  卷积自编码器  自然语言处理
收稿时间:2021/4/25 0:00:00
修稿时间:2021/9/21 0:00:00

Character Vector Representation Method Combining Chinese Character Glyph and Character Semantics
Tang Shancheng,Zhang Xue,Zhang Puyue,Wang Hanbo,Chen Ming.Character Vector Representation Method Combining Chinese Character Glyph and Character Semantics[J].Science Technology and Engineering,2021,21(32):13787-13792.
Authors:Tang Shancheng  Zhang Xue  Zhang Puyue  Wang Hanbo  Chen Ming
Institution:School of Communication and Information Engineering, Xi''an University of Science and Technology
Abstract:The quality of word vector representation has an important influence on the effect of Chinese text processing methods. At present, the commonly used Chinese character vector representation methods such as Word2Vec and GloVe perform well in many tasks, However, there are some problems, such as the vector quality depends on the training data set, the stability is poor, the semantic information implied in the whole Chinese font structure is not considered, and the language knowledge contained in the dictionary is not used. In order to overcome the shortcomings of the existing methods, this paper first uses the font self encoder to automatically capture the semantics of Chinese characters, and then extracts the stable word meaning information contained in the dictionary by using the self coder, and proposes a representation method of the word vector which integrates the Chinese glyph and the word meaning, The results show that GnM2vec has achieved good results in the three tasks of nearest neighbor word calculation, Chinese named entity recognition and Chinese word segmentation. Compared with Glove, Word2vec and G2vec (based on glyph vector), F1 value in named entity recognition is increased by 2.25, 0.05 and 0.3 respectively; in Chinese word segmentation, F1 value is increased by 0.3, 0.14 and 0.33 respectively, which improves the stability of character vector.
Keywords:character vector representation    character glyph    character semantics    convolution self encoder    natural language processing
点击此处可从《科学技术与工程》浏览原始摘要信息
点击此处可从《科学技术与工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号