首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于《知网》的文本相似度研究
引用本文:袁晓峰.基于《知网》的文本相似度研究[J].成都大学学报(自然科学版),2014,33(3):251-253.
作者姓名:袁晓峰
作者单位:盐城师范学院信息科学与技术学院,江苏盐城,224002
摘    要:计算文本相似度常用的方法是计算以VSM表示的文本之间的夹角余弦值,但这种方法并没有考虑文本中词语之间的语义相似度.另外由于计算余弦值时要考虑VSM向量对齐,从而导致计算的高维度、高复杂性.《知网》作为一个汉语常用的知识库得到广泛的研究,利用该知识库能方便地求得汉语词语之间的相似度.利用《知网》计算每篇文本中词语之间的相似度,对VSM进行改进,用少量特征词的TF/IDF值作为改进后的VSM向量中的权重,进而计算文本之间的相似度.通过比较改进前后的VSM的维数、召回率和准确率,结果显示,改进后的算法明显降低了计算的复杂度并提高了召回率和准确率.

关 键 词:知网  语义相似度  VSM  文本相似度

Research of Text Similarity Based on HowNet
YUAN Xiaofeng.Research of Text Similarity Based on HowNet[J].Journal of Chengdu University (Natural Science),2014,33(3):251-253.
Authors:YUAN Xiaofeng
Institution:YUAN Xiaofeng;College of Information Science and Technology,Yancheng Teachers University;
Abstract:The commonly used method of text similarity calculation is to calculate the cosine value of the vector demonstrated by VSM.However,in this method,the semantic similarity among words in a text is not considered.In addition,the VSM vector alignment should be considered during the calculation process of cosine value,which will result in high dimension and high complexity of computation.HowNet is a kind of Chinese ontology which is widely used.It is easy to calculate the similarity between two chinese words by using HowNet.In this paper,we improve the VSM.We use the TF*IDF values of a small amount of feature words as weights of the improved VSM vector,and then calculate the similarity between texts.Finally,this paper compares the dimension,the recall rate and the precision rate between both nonimproved VSM and improved VSM.The results show that the improved VSM significantly reduces the computational complexity and improves the recall rate and precision rate.
Keywords:HowNet  semantic similarity  VSM  text similarity
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号