首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种基于后缀数组的无词典分词方法
引用本文:张长利,赫枫龄,左万利.一种基于后缀数组的无词典分词方法[J].吉林大学学报(理学版),2004,42(4):548-553.
作者姓名:张长利  赫枫龄  左万利
作者单位:吉林大学计算机科学与技术学院, 长春 130012
基金项目:国家自然科学基金(批准号:60373099).
摘    要:提出一种基于后缀数组的无词典分词算法. 该算法通过后缀数组和利用散列表获得汉字的结合模式, 通过置信度筛选词. 实验表明, 在无需词典和语料库的前提下, 该算法能够快速准确地抽取文档中的中、 高频词. 适用于对词条频度敏感、 对计算速度要求高的中文信息处理.

关 键 词:中文信息处理  中文自动分词  后缀数组  散列表  
文章编号:1671-5489(2004)04-0548-06
收稿时间:2004-02-25
修稿时间:2004年2月25日

An automatic and dictionary-free Chinese word segmentation method based on suffix array
ZHANG Chang-li,HE Feng-ling,ZUO Wan-li.An automatic and dictionary-free Chinese word segmentation method based on suffix array[J].Journal of Jilin University: Sci Ed,2004,42(4):548-553.
Authors:ZHANG Chang-li  HE Feng-ling  ZUO Wan-li
Institution:College of Computer Science and Technology, Jilin University, Changchun 130012
Abstract:An automatic and dictionary-free Chinese word segmentation method based on suffix array algorithm is proposed. By the algorithm based on suffix array and by using HashMap the co-occurrence patterns of (Chinese) characters are gotten, and Chinese words are filtered through confidence. Experiment results show that by the algorithm one can acquire high frequency lexical items effectively and efficiently without the help of the dictionary and corpus as well. This method is particularly suitable for lexical-frequency-sensitive as well as time-critical Chinese information processing application.
Keywords:Chinese information processing  automatic Chinese word segmentation  suffix array  HashMap
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《吉林大学学报(理学版)》浏览原始摘要信息
点击此处可从《吉林大学学报(理学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号