消解中文三字长交集型分词歧义的算法 Algorithm for solving 3-character crossing ambiguities in Chinese word segmentation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

消解中文三字长交集型分词歧义的算法

引用本文：	孙茂松,左正平,黄昌宁.消解中文三字长交集型分词歧义的算法[J].清华大学学报(自然科学版),1999,39(5):geMap1.

作者姓名：	孙茂松左正平黄昌宁

作者单位：	1. 清华大学,计算机科学与技术系,北京,100084 2. 清华大学,智能技术与系统国家重点实验室,北京,100084

摘要：	汉语自动分词在中文信息处理现实应用中占据着十分重要的位置。三字长交集型分词歧义是分词歧义的主要类型之一，在真实文本中的出现频率相当高。提出了一种针对这种分词歧义的消解算法，回避了训练代价比较高昂的词性信息而仅仅利用了词的概率信息及某些具有特定性质的常用字集合。从一个６０万字的汉语语料库中抽取出全部不同的三字长交集型分词歧义共５３６７个作为测试样本。实验结果表明，该算法的消解正确率达到了９２．０７％，基本可以满足实用型中文信息处理系统的需要。
关键词：	计算语言学中文信息处理汉语自动分词交集型分词歧义分词歧义消解算法
修稿时间：	1998-10-18
Algorithm for solving 3-character crossing ambiguities in Chinese word segmentation

SUN Maosong,ZUO Zhengping,HUANG Changning.Algorithm for solving 3-character crossing ambiguities in Chinese word segmentation[J].Journal of Tsinghua University(Science and Technology),1999,39(5):geMap1.

Authors:	SUN Maosong ZUO Zhengping HUANG Changning

Abstract:	The technique of Chinese word segmentation plays an important role in many applications of Chinese information processing. Being one of the major types of segmentation ambiguities, crossing ambiguities with length of 3 characters can be frequently found in Chinese running texts. An algorithm aiming at this type of ambiguities is proposed in the paper: instead of making use of part of speech statistical information which needs comparatively high training cost, the algorithm simply employs word frequency information and some common Chinese character subsets with defined properties. The preliminary experiment on 5367 examples, extracted from a Chinese corpus of 0.6 million characters, shows that the segmentation precision of the algorithm reaches 92.07%, which is satisfactory for practical Chinese information processing systems.

Keywords:	computational linguistics Chinese information processing Chinese word segmentation crossing ambiguities in Chinese word segmentation disambiguation algorithm for Chinese word segmentation
本文献已被 CNKI 万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏