Abstract:Visual emotion analysis aims to analyze the emotional response of human beings to visual stimuli, which has attracted multimedia visual data related fields such as sharing platforms and social networking in recent years. Traditional image emotion analysis focuses on the classification of single label emotions, ignoring the complexity of emotions expressed in pictures and the potential emotional distribution information of images, and failing to reflect the correlation between different emotions expressed in pictures. To solve the above problems, ViT and Resnet networks are used to extract multi-scale emotional features with global and local fusion, and the label distribution learning method is used for image emotion prediction. Significant results are achieved on the public available Flickr_LDL dataset and Twitter_LDL dataset, which demostrate the effectiveness of the proposed method.