首页 | 本学科首页   官方微博 | 高级检索  
     

基于二元背景模型的新词发现
引用本文:吴悦,燕鹏举,翟鲁峰. 基于二元背景模型的新词发现[J]. 清华大学学报(自然科学版), 2011, 0(9)
作者姓名:吴悦  燕鹏举  翟鲁峰
作者单位:复旦大学数学科学学院;盛大语音创新院;
摘    要:该文提出一种基于二元背景模型的新词发现方法。采用前、背景语料二元似然比挑选候选二元组(bigram);然后根据频率、刚性、条件概率等基于前景语料的统计量,对二元组进行进一步筛选和扩展,以确定新词边界。用该方法提取出的词既包含新词特征,又可以成词。而且该方法充分利用现有背景生语料却无需分词等标注信息,不依赖词典、分词模型和规则,具有良好的扩展性。为了得到更好的发现效果,还讨论了各统计量阈值的选取策略和垃圾元素剔除策略。该方法在网络小说语料上验证了其有效性。

关 键 词:新词发现  二元组  背景模型  似然比  

New word detection based on a background bigram model
WU Yue,YAN Pengju,ZHAI Lufeng. New word detection based on a background bigram model[J]. Journal of Tsinghua University(Science and Technology), 2011, 0(9)
Authors:WU Yue  YAN Pengju  ZHAI Lufeng
Affiliation:WU Yue1,YAN Pengju2,ZHAI Lufeng2(1.School of Mathematical Sciences,Fudan University,Shanghai 200433,China,2.Shanda Innovations-Speech,Shanghai 201203,China)
Abstract:A new word detection method was developed that first extracts bigrams from the target foreground corpus based on their foreground and background likelihood ratio.Then,it filters and extends the bigrams to qualified new words according to statistical metrics including the frequency,rigidity and conditional probability.The method makes sure that the selected words are actually new based on background knowledge,and fixes the word boundary precisely according to the statistical metrics.The method requires no re...
Keywords:new word detection  bigram  background model  likelihood ratio  
本文献已被 CNKI 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号