互联网商品匹配算法 Product matching based on Internet and its implementation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

互联网商品匹配算法

引用本文：	顾颀,朱灿,曹健. 互联网商品匹配算法[J]. 上海大学学报(自然科学版), 2016, 22(1): 58-68. DOI: 10.3969/j.issn.1007-2861.2015.04.016

作者姓名：	顾颀朱灿曹健

作者单位：	1. 上海交通大学电子信息与电气工程学院, 上海 200240; 2. 南通大学计算机科学与技术学院, 江苏南通 226019

基金项目：	国家自然科学基金资助项目(61272438, 61472253, 61300167); 上海市科委资助项目(15411952502, 14511107702)

摘要：	实体解析是指识别同一实体的不同描述形式的过程, 旨在保障数据质量, 是数据清理、数据集成及数据挖掘中的关键技术. 随着电子商务的不断发展和成熟, 商品的多样性和消费者灵活的购买方式, 使得对网络商品的精确识别和匹配成为大数据时代亟待解决的问题. 与传统实体解析主要针对结构化数据不同, 网络数据具有非结构化、异构和海量的特性, 为此设计了综合相似度算法(synthesized similarity method, SSM)来计算网络商品数据间的相似度, 同时引入凝聚的层次聚类框架, 以匹配来自不同数据源的异构商品. 此外, 为了解决大数据环境下对执行效率的要求, 从字符串相似度缓存、约束知识库和分块策略三个方面对SSM进行优化, 基于真实数据集的实验结果验证了SSM的执行效率和有效性.
关键词：	大数据非结构化数据商品匹配实体解析
收稿时间：	2015-11-30
Product matching based on Internet and its implementation

GU Qi,ZHU Can,CAO Jian. Product matching based on Internet and its implementation[J]. Journal of Shanghai University(Natural Science), 2016, 22(1): 58-68. DOI: 10.3969/j.issn.1007-2861.2015.04.016

Authors:	GU Qi ZHU Can CAO Jian

Affiliation:	1. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; 2. School of Computer Science and Technology, Nantong University, Nantong 226019, Jiangsu, China

Abstract:	Entity resolution identifies entities from different data sources that refer to the same real-world entity. It is an important prerequisite for data cleaning, data integration and data mining, and is a key in ensuring data quality. With the rapid growth of E-commerce, diversity of products and flexible buying patterns of consumers, product identification and matching becomes a long-standing research topic in the big data era. While the traditional entity resolution approaches focus on structured data, the Internet data are neither standardized nor structured. In order to address this problem, this paper presents a synthesized similarity method to calculate similarity between different products. An agglomerate hierarchical clustering method is used to identify products from different sources. Also, the approach is optimized to improve efficiency of execution in three aspects: global cache, knowledge constraints, and blocking strategies. Finally, a series of experiments are performed on real data sets. The experimental results show that the proposed approach has a better performance compared with others.

Keywords:	big data entity resolution product matching unstructured data
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《上海大学学报(自然科学版)》浏览原始摘要信息
	点击此处可从《上海大学学报(自然科学版)》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏