Knowledge Enhanced Pre-Training Model for Vision-Language-Navigation Task期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Knowledge Enhanced Pre-Training Model for Vision-Language-Navigation Task

Authors:	HUANG Jitao ZENG Guohui HUANG Bo GAO Yongbin LIU Jin SHI Zhicai

Abstract:	Vision-Language-Navigation(VLN) task is a cross-modality task that combines natural language processing and computer vision. This task requires the agent to automatically move to the destination according to the natural language instruction and the observed surrounding visual information. To make the best decision, in every step during the navigation, the agent should pay more attention to understanding the objects, the object attributes, and the object relationships. But most current methods process all received textual and visual information equally. Therefore, this paper integrates more detailed semantic connections between visual and textual information through three pre-training tasks(object prediction, object attributes prediction, and object relationship prediction). The model will learn better fusion representation and alignment between these two types of information to improve the success rate(SR) and generalization. The experiments show that compared with the former baseline models, the SR on the unseen validation set(Val Unseen) increased by 7%, and the SR weighted by path length(SPL) increased by 7%; the SR on the test set(Test) increased 4%, SPL increased by 3%.

Keywords:
本文献已被 CNKI 等数据库收录！