首页 | 本学科首页   官方微博 | 高级检索  
     检索      

支持多语言的自然语言倒序分词最大成词算法
引用本文:王智慧,姜建国,张秋亮.支持多语言的自然语言倒序分词最大成词算法[J].科学技术与工程,2007,7(17):4311-4315.
作者姓名:王智慧  姜建国  张秋亮
作者单位:1. 西安电子科技大学计算机学院,西安,710071
2. 华北电力大学计算机学院,保定,071003
摘    要:提出一种支持多语言的分词算法,该算法可以按照以下层次来理解:首先,对不同编码的源词库文件编码转换,生成Unicode编码的源词库文件;然后,用Unicode编码的词库文件生成Unicode词库索引;最后,将待分词的自然语句转换成Unicode编码的语句并按照索引倒序分词。该算法已经用C++语言实现,基于此算法实现的分析系统能够自动探测词库的更新并确定是否需要更新索引,能够支持多种编码方式,其编码转换和分词代码与平台无关,分词效率在9MB/s以上,正确率在90%以上。

关 键 词:多语言  索引树  倒序分词  最大成词算法
文章编号:1671-1819(2007)17-4311-05
修稿时间:2007-04-26

Maximum Term Segmentation Algorithm in Reverse Order for Multiform Natural Language
WANG Zhi-hui,JIANG Jian-guo,ZHANG Qiu-liangl.Maximum Term Segmentation Algorithm in Reverse Order for Multiform Natural Language[J].Science Technology and Engineering,2007,7(17):4311-4315.
Authors:WANG Zhi-hui  JIANG Jian-guo  ZHANG Qiu-liangl
Abstract:The word segmentation algorithm of support multi-language is proposed. The algorithm can be understood according to the following levels: The first is the code conversion,the different source thesaurus documents are turned to Unicode thesaurus documents; Then,Unicode thesaurus file index is generated based on Unicode thesaurus documents; Finally, the natural language will be converted into Unicode encoding, and the words begin to be segmented in a reverse order according to the Unicode thesaurus file index. The algorithm has been completed by using C++ language, and the system can detect the changes of the source thesaurus documents automatically to determine whether there is a need to update the Unicode thesaurus index. The system can support a variety of coding types. The process of code conversion and word segmentation is independent on the platform. The efficiency of the segmentation is more than 9 MB/s , the accuracy rate is more than 90%.
Keywords:multi-language file index segmentation in reverse maximum term  
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《科学技术与工程》浏览原始摘要信息
点击此处可从《科学技术与工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号