首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于拼音索引的中文模糊匹配算法
引用本文:曹犟,邬晓钧,夏云庆,郑方.基于拼音索引的中文模糊匹配算法[J].清华大学学报(自然科学版),2009(Z1).
作者姓名:曹犟  邬晓钧  夏云庆  郑方
作者单位:清华大学计算机科学与技术系;清华信息科学技术国家实验室技术创新和开发部语音和语言技术中心;
基金项目:国家自然科学基金资助项目(60703051)
摘    要:主流商业搜索引擎主要基于关键词精确匹配技术。为提高在用户的输入错误时的检索效率,提出了有索引的汉语模糊匹配算法。该算法采用汉字、拼音和拼音改良的编辑距离这3种汉字相似程度的不同度量方式,对用户查询进行扩展,将模糊匹配转化为多个精确匹配,对精确匹配的结果按与查询串的相似程度进行排序。在实验中,将该方法应用于网页文本语料库中。在使用基于拼音改良的编辑距离度量方式时,在时间和空间复杂度增长不大的情况下,该方法取得了60.42%的准确率与50.41%召回率。

关 键 词:文件信息处理  拼音索引  模糊匹配  查询扩展  

Pinyin-indexed method for approximate matching in Chinese
CAO Jiang,WU Xiaojun,XIA Yunqing,ZHENG Fang.Pinyin-indexed method for approximate matching in Chinese[J].Journal of Tsinghua University(Science and Technology),2009(Z1).
Authors:CAO Jiang    WU Xiaojun  XIA Yunqing  ZHENG Fang
Institution:1.Department of Computer Science and Technology;Tsinghua University;Beijing 100084;China;2.Center for Speech and Language Technologies;Division of Technical Innovation and Development;Tsinghua National Laboratory for Information Science and Technology;China
Abstract:The exact matching of is key to popular commercial search engines.A Chinese approximate matching method with an index structure was developed to achieve better retrieval when the input contains errors.Three types of similarity measurement between two Chinese strings were developed based on the character edit-distance,the Pinyin edit-distance and the Pinyin improved edit-distance.The similarity measurements were used to expand the user's query so that the approximate matching task can be represented as sever...
Keywords:
本文献已被 CNKI 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号