首页 | 本学科首页   官方微博 | 高级检索  
     检索      

一种融合多模态特征的视频暴力检测方法
引用本文:马境远,刘鲲,傅慧源.一种融合多模态特征的视频暴力检测方法[J].重庆邮电大学学报(自然科学版),2021,33(5):861-867.
作者姓名:马境远  刘鲲  傅慧源
作者单位:北邮感知技术研究院(江苏)有限公司,江苏 无锡214115;北京邮电大学 智能通信软件与多媒体北京市实验室,北京100876
基金项目:国家自然科学基金(61872047);北邮-传音“视觉感知与计算”联合实验室项目
摘    要:暴力事件检测是视频内容智能分析的一个常见任务,在互联网视频内容审查、影视作品分析、安防视频监控等领域有重要应用.面向视频中暴力检测任务,提出了一个包含关系网络和注意力机制的方法来融合视频中的多模态特征,该方法首先使用深度学习提取视频中多个模态特征,如音频特征、光流特征、视频帧特征,接着组合不同的模态特征,利用关系网络来建模多模态之间的关系;然后基于深度神经网络设计了多头注意力模块,学习多个不同的注意力权重来聚焦视频的不同方面,以生成区分力强的视频特征.该方法可以融合视频中多个模态,提高了暴力检测准确率.在公开数据集上训练和验证的实验结果表明,提出的多模态特征融合方法,与仅使用单模态数据的方法和现有多模态融合的方法相比,具有明显的优势,检测准确率分别提升了4.89%和1.66%.

关 键 词:注意力机制  关系网络  多模态融合  暴力检测  视频内容分析
收稿时间:2021/8/21 0:00:00
修稿时间:2021/9/15 0:00:00

A video violence detection method based on multi-modal feature fusion
MA Jingyuan,LIU Kun,FU Huiyuan.A video violence detection method based on multi-modal feature fusion[J].Journal of Chongqing University of Posts and Telecommunications,2021,33(5):861-867.
Authors:MA Jingyuan  LIU Kun  FU Huiyuan
Institution:BUPT Sensing Technology Research Institute (Jiangsu) Co., LTD, Wuxi 214115, P. R. China;Beijing Key Lab. of Intelligent Telecommunication Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, P. R. China
Abstract:Violence detection is a common task in the intelligent analysis of video content. It has many important applications in the fields of Internet video content review, film and television works analysis, video surveillance and so on. In this paper, a method composed of relation network and attention module is proposed to fuse the multi-modal features in the video for the task of violence detection. The approach first uses deep learning techniques to extract the audio, optical flow, and RGB frames features, then combines different modal features, and uses relational network to model the relationship between multimodal features. Then, based on the deep neural network, a multi-head attention module is designed to learn multiple different attention weights to focus on different aspects of the video, so as to generate highly differentiated video features. The proposed method can fuse multiple modalities in the video and improve the accuracy of violence detection. The experimental results on the public data set show that the proposed multi-modal feature fusion method has obvious advantages over the method based on single-modal data and the method based on multi-modal fusion, and the detection accuracy is increased by 4.89% and 1.66%, respectively.
Keywords:attention module  relation network  multimodal fusion  violence detection  video content analysis
本文献已被 万方数据 等数据库收录!
点击此处可从《重庆邮电大学学报(自然科学版)》浏览原始摘要信息
点击此处可从《重庆邮电大学学报(自然科学版)》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号