基于多尺度时序交互的第一人称行为识别方法 Egocentric action recognition method based on multi-scale temporal interaction期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于多尺度时序交互的第一人称行为识别方法

引用本文：	罗祥奎,高陈强,陈欣悦,王升伟.基于多尺度时序交互的第一人称行为识别方法[J].重庆邮电大学学报(自然科学版),2024,36(3):524-532.

作者姓名：	罗祥奎高陈强陈欣悦王升伟

作者单位：	重庆邮电大学通信与信息工程学院, 重庆 400065;信号与信息处理重庆市重点实验室, 重庆 400065

基金项目：	重庆市教委科学技术研究项目(KJZD-K202100606)

摘要：	对于第一人称行为识别任务，现有方法大多使用了目标边界框和人眼视线数据等非行为类别标签对深度神经网络进行辅助监督，以使其关注视频中手部及其交互物体所在区域。这既需要更多的人工标注数据，又使得视频特征的提取过程变得更为复杂。针对该问题，提出了一种多尺度时序交互模块，通过不同尺度的3D时序卷积使2D神经网络提取的视频帧特征进行时序交互，从而使得单一视频帧的特征融合其近邻帧的特征。在只需行为类别标签作监督的情况下，多尺度时序交互能够促使网络更加关注第一人称视频中手部及其交互物体所在区域。实验结果表明，提出的方法在识别准确率优于现有第一人称行为识别方法。
关键词：	行为识别第一人称视觉时序交互深度学习
收稿时间：	2023/5/25 0:00:00
修稿时间：	2024/2/26 0:00:00
Egocentric action recognition method based on multi-scale temporal interaction

LUO Xiangkui,GAO Chenqiang,CHEN Xinyue,WANG Shengwei.Egocentric action recognition method based on multi-scale temporal interaction[J].Journal of Chongqing University of Posts and Telecommunications,2024,36(3):524-532.

Authors:	LUO Xiangkui GAO Chenqiang CHEN Xinyue WANG Shengwei

Institution:	School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China;Chongqing Key Laboratory of Signal and Information Processing, Chongqing 400065, P. R. China

Abstract:	For egocentric action recognition task, most existing methods use non-behavior category labels such as target bounding boxes and eye gaze data to assist in supervising deep neural networks, so that they can focus on the areas where hands and their interactive objects are located in the video. This requires more manually labeled data and makes the process of extracting video features more complex. To address this issue, a multi-scale temporal interaction module is proposed, which enables 2D neural networks to extract video frame features through different scales of 3D temporal convolution for temporal interaction, so that the features of a single video frame can be fused with those of its neighboring frames. With only behavior category labels for supervision, multi-scale temporal interaction can promote the network to pay more attention to the areas where hands and their interactive objects are located in egocentric videos. Experimental results show that the proposed method has better recognition accuracy than existing egocentric action recognition methods.

Keywords:	action recognition egocentric vision temporal interaction deep learning

	点击此处可从《重庆邮电大学学报(自然科学版)》浏览原始摘要信息
	点击此处可从《重庆邮电大学学报(自然科学版)》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏