基于元数据与领域概念树的文本相似度计算 Computation of document similarity based on metadata and domain concept tree期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于元数据与领域概念树的文本相似度计算

引用本文：	张佩云,,陈恩红,谢荣见,宫秀文,黄波. 基于元数据与领域概念树的文本相似度计算[J]. 系统工程与电子技术, 2014, 36(3): 591-597

作者姓名：	张佩云  陈恩红 谢荣见 宫秀文 黄波

作者单位：	(1. 安徽师范大学数学计算机科学学院，安徽芜湖 241003；2. 中国科学技术大学计算机科学与技术学院，安徽合肥 230026；3. 中国科学技术大学管理学院，安徽合肥 230026；4. 南京理工大学计算机科学与技术学院，江苏南京 210094)

摘要：	随着网络与信息技术的快速发展，导致网络上产生了大量的电子文本，而文本间的相似度计算是文本处理的一种重要手段。对于大规模的文本集，通常采用向量空间模型（vector space model， VSM）进行文本表示，但是该方法面临着文本向量维度较高及文本语义相似度难以度量的问题。提出一种改进的文本相似度计算方法，从大量的特征空间中选择出具有代表性的元数据特征向量元素，以降低向量空间的维度；构建领域概念树并设计基于领域概念树的文本相似度算法，对领域概念中广泛存在的同义词进行处理，以提高文本之间语义相似度度量的性能。实验结果表明：通过降维和概念相似度计算可提高文本相似度计算的性能。
Computation of document similarity based on metadata and domain concept tree

ZHANG Pei yun,,CHEN En hong,XIE Rong jian,CONG Xiu wen,HUANG Bo. Computation of document similarity based on metadata and domain concept tree[J]. System Engineering and Electronics, 2014, 36(3): 591-597

Authors:	ZHANG Pei yun  CHEN En hong XIE Rong jian CONG Xiu wen HUANG Bo

Affiliation:	(1.School of Mathematics and Computer Science, Anhui Normal University, Wuhu 241003, China; ;2.School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China; ;3.School of Management, University of Science and Technology of China, Hefei 230026, China;;4.School of Computer Science & Technology, Nanjing University of Science and Technology, Nanjing 210094, China)

Abstract:	With the rapid development of network and information technology, a large number of electronic documents appear on the network, and the similarity computaion between the documents is an important means of document processing. For large-scale collection of documents, vector space model (VSM) is usually used for document representation, but the method is facing the problems of higher dimension and lack of semantic similarity. An improved method for calculating the similarity of document is proposed. Metadata feature vectors are selected from a large number of representative feature space, so that it can reduce the dimension of the vector space. The domain concept tree is constructed and the algorithm for computing document similarity is designed. In order to improve the document semantic similarity of algorithm performance, the synonym concepts which exist in widespread areas are processed. The experimental results show that the proposed method can improve the performance of document similarity computation based on the dimensionality reduction and the concepts similarity computing.

Keywords:

	点击此处可从《系统工程与电子技术》浏览原始摘要信息
	点击此处可从《系统工程与电子技术》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏