一种基于互信息的词聚类算法 A Word Clustering Method Based on Mutual Information期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一种基于互信息的词聚类算法

引用本文：	袁里驰.一种基于互信息的词聚类算法[J].系统工程,2008,26(5).

作者姓名：	袁里驰

作者单位：	中南大学,信息科学与工程学院,湖南,长沙,410083;江西财经大学,信息管理学院,江西,南昌,330013

基金项目：	中南大学校科研和教改项目 , 国家自然科学基金

摘要：	基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法.传统的统计聚类方法基于贪婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准.这种传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优.本文利用互信息定义了一种词相似度,在词相似度的基础上给出了词集合相似度的定义.基于相似度,提出了一种自下而上的分层聚类算法,这种方法不但能改善聚类效果,而且可根据不同的模型选择不同的相似度定义,因而提高聚类的使用效果.实验证明,该算法在计算复杂度和聚类效果上比传统的基于贪婪原则的统计聚类算法都有明显的改进.
关键词：	互信息词相似度聚类算法统计语言模型互信息聚类算法 Mutual Information Based Method Clustering 改进计算复杂度验证聚类效果使用相似度定义选择模型改善自下而上词集词相似度信息定义利用
A Word Clustering Method Based on Mutual Information

YUAN Li-chi.A Word Clustering Method Based on Mutual Information[J].Systems Engineering,2008,26(5).

Authors:	YUAN Li-chi

Institution:	1.College of information Science and Engineering;Central South University;Changsha 410083;China;2.School of Information Technology;Jiangxi University of Finance & Economics;Nanchang 330013;China

Abstract:	Cluster-based statistic language model is an important method for solving the problem of sparse data.Conventional statistical clustering methods usually base on greedy principle.The common standard for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus.Conventional clustering algorithms often converge to a local optimum,so global optimum is not guaranteed,and initial choices can influence the final results.In order to solve these problems,we first give a definition to w...

Keywords:	Mutual Information Word Similarity Clustering Algorithm Statistical Language Model
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏