一种基于后缀数组的无词典分词方法 An automatic and dictionary-free Chinese word segmentation method based on suffix array期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一种基于后缀数组的无词典分词方法

引用本文：	张长利,赫枫龄,左万利.一种基于后缀数组的无词典分词方法[J].吉林大学学报(理学版),2004,42(4):548-553.

作者姓名：	张长利赫枫龄左万利

作者单位：	吉林大学计算机科学与技术学院, 长春 130012

基金项目：	国家自然科学基金(批准号:60373099).

摘要：	提出一种基于后缀数组的无词典分词算法. 该算法通过后缀数组和利用散列表获得汉字的结合模式, 通过置信度筛选词. 实验表明, 在无需词典和语料库的前提下, 该算法能够快速准确地抽取文档中的中、高频词. 适用于对词条频度敏感、对计算速度要求高的中文信息处理.
关键词：	中文信息处理中文自动分词后缀数组散列表
文章编号：	1671-5489(2004)04-0548-06
收稿时间：	2004-02-25
修稿时间：	2004年2月25日
An automatic and dictionary-free Chinese word segmentation method based on suffix array

ZHANG Chang-li,HE Feng-ling,ZUO Wan-li.An automatic and dictionary-free Chinese word segmentation method based on suffix array[J].Journal of Jilin University: Sci Ed,2004,42(4):548-553.

Authors:	ZHANG Chang-li HE Feng-ling ZUO Wan-li

Institution:	College of Computer Science and Technology, Jilin University, Changchun 130012

Abstract:	An automatic and dictionary-free Chinese word segmentation method based on suffix array algorithm is proposed. By the algorithm based on suffix array and by using HashMap the co-occurrence patterns of (Chinese) characters are gotten, and Chinese words are filtered through confidence. Experiment results show that by the algorithm one can acquire high frequency lexical items effectively and efficiently without the help of the dictionary and corpus as well. This method is particularly suitable for lexical-frequency-sensitive as well as time-critical Chinese information processing application.

Keywords:	Chinese information processing automatic Chinese word segmentation suffix array HashMap
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《吉林大学学报(理学版)》浏览原始摘要信息
	点击此处可从《吉林大学学报(理学版)》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏