首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于相码模型的汉字表征
引用本文:范晓明,王斌君.基于相码模型的汉字表征[J].科学技术与工程,2021,21(5):1937-1947.
作者姓名:范晓明  王斌君
作者单位:中国人民公安大学信息技术与网络安全学院,北京100038;北京警察学院网络安全保卫系,北京102202;中国人民公安大学信息技术与网络安全学院,北京100038
基金项目:北京市社会科学基金项目,北京市公安局局级课题
摘    要:为解决汉语自然语言处理任务中未登录词问题,人们经常利用汉字的笔画、偏旁、拼音等细粒度特征提高模型的学习能力.为找出这类特征的最佳组合,通过统计方法研究了汉字的音节、起笔、偏旁、声调、词频、笔画数等特征,提出一种可融合多种汉字特征的跨象限助记符映射模型,即相码模型,该模型可自动实现中文字、词与字母编码间的可逆映射.在字符级模型的文本分类实验中,效果理想.此外,模型生成的编码长度适中,保留了可读特性,可用于特殊场合的文本标注,也能为中文文本提供等量的平行语料数据.可见,相码模型是自然语言处理中一个较好的辅助模型.

关 键 词:汉字表征  助记符  编码  映射
收稿时间:2020/5/6 0:00:00
修稿时间:2021/2/5 0:00:00

Characterization of Chinese Characters Based on Cross Quadrant Mnemonic Mapping Model
Fan Xiaoming,Wang Binjun.Characterization of Chinese Characters Based on Cross Quadrant Mnemonic Mapping Model[J].Science Technology and Engineering,2021,21(5):1937-1947.
Authors:Fan Xiaoming  Wang Binjun
Abstract:In order to solve the OOV(out of vocabulary) problem in Chinese natural language processing, people often use the fine-grained characteristics of Chinese characters such as strokes, radicals, Pinyin to improve the learning ability of the model. Arround finding the best combination of these features, this paper studied the syllable, first stroke, radical, tone, word frequency, stroke number and other features of Chinese characters by statistical method, and proposed a cross-quadrant mnemonic mapping model which can integrate multiple Chinese characters features. The model can automatically realize the reversible mapping among Chinese characters, words and sequence codes of 26 Latin letters. In the text classification experiment of character-level model, the effect is ideal. In addition, the coding length of the model is moderate, and it retains the readability. It can be used for text annotation in special occasions, and can also provide equal amount of parallel corpus data for Chinese text. So, it is a better auxiliary model in natural language processing.
Keywords:characterization of Chinese characters      mnemonic      encoding      mapping
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《科学技术与工程》浏览原始摘要信息
点击此处可从《科学技术与工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号