首页 | 本学科首页   官方微博 | 高级检索  
     

统计与规则相结合的藏文句子自动断句方法
引用本文:徐涛,加羊吉,于洪志. 统计与规则相结合的藏文句子自动断句方法[J]. 云南大学学报(自然科学版), 2012, 0(6): 653-657,663
作者姓名:徐涛  加羊吉  于洪志
作者单位:西北民族大学,国家民委-教育部中国民族语言文字信息技术重点实验室
基金项目:国家自然科学基金资助项目(61032008,60970071);甘肃省自然科学基金资助项目(1107RJZA157)
摘    要: 藏文句子断句是藏文信息处理领域的难点之一,也是藏汉机器翻译、藏文文本分类等工作的一项重要基础性研究.提出了一种统计与规则相结合的藏文句子自动断句方法以解决藏文标点符号功能的歧义问题,实验结果表明该方法具有比较好的效果,F1值达到98%以上.在规则中首先使用经验的方法,识别出不确定的藏文句子作为候选句子,然后采用基于关联词的复句分析方法进行分句合并形成二次候选句子;最后使用最大熵的方法对二次候选句子进行断句.经验方法和复句分析有效解决了最大熵算法无法触及的语料稀疏和分句问题.

关 键 词:藏文句子自动断句  复句分析  二次候选句子  最大熵

An approach of automatic segmentation for Tibetan sentence based on rules and statistics
XU Tao,JIA Yang-ji,YU Hong-zhi. An approach of automatic segmentation for Tibetan sentence based on rules and statistics[J]. Journal of Yunnan University(Natural Sciences), 2012, 0(6): 653-657,663
Authors:XU Tao  JIA Yang-ji  YU Hong-zhi
Affiliation:(Key Lab of China’s National Linguistic Information Technology,Northwest University for Nationalities,Lanzhou 730030,China)
Abstract:Segmentation of Tibetan sentences is one of the difficult task in the area of Tibetan information processing,and is also one of the key foundational researches of Tibetan-Chinese Machine Translation,Text Categorization,etc.To deal with the ambiguous functions of the Tibetan punctuations,this paper proposes a method of automatic segmentation of Tibetan sentences,which combines statistics and rules.The experiment shows that this approach works really well:the F1-measure reaches 98% and more.First,the experience method is used in rules to identify the ambiguous Tibetan sentences which are the candidate sentences.Then the analysis of compound sentences which is based on conjunctive words is used to combine clauses to form the further candidate sentences.Finally,the method of Maximum Entropy is used to cut up the further candidate sentences according to the meanings.Thus the experience method and the analysis of compound sentences have solved the problems of sparse corpus and clauses that Maximum Entropy cannot work out.
Keywords:automatic segmentation of Tibetan sentences  analysis of compound sentences  further candidate sentences  maximum entropy
本文献已被 CNKI 等数据库收录!
点击此处可从《云南大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《云南大学学报(自然科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号