Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition

Open Access

25 September 2021

journal article
research article
Published by Hindawi Limited in Computational Intelligence and Neuroscience

Vol. 2021, 1-11
https://doi.org/10.1155/2021/5585041

Abstract

The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in different images may be different, which is often ignored. To address the problem of emotion difference between different modes and different images, this paper proposes a hierarchical attention-based multimodal fusion network for video emotion recognition, which consists of a multimodal feature extraction module and a multimodal feature fusion module. The multimodal feature extraction module has three subnetworks used to extract features of facial, scene, and global images. Each subnetwork consists of two branches, where the first branch extracts the features of different modes, and the other branch generates the emotion score for each image. Features and emotion scores of all images in a modal are aggregated to generate the emotion feature of the modal. The other module takes multimodal features as input and generates the emotion score for each modal. Finally, features and emotion scores of multiple modes are aggregated, and the final emotion representation of the video will be produced. Experimental results show that our proposed method is effective on the emotion recognition dataset.

Keywords

Funding Information

He’nan Educational Committee (21A520006, 182102310919)

This publication has 24 references indexed in Scilit:

Emotion-modulated attention improves expression recognition: A deep learning model
Neurocomputing, 2017
Video Analytics for Customer Emotion and Satisfaction at Contact Centers
IEEE Transactions on Human-Machine Systems, 2017
HoloNet: towards robust emotion recognition in the wild
Published by Association for Computing Machinery (ACM) ,2016
Video Emotion Recognition with Transferred Deep Feature Encodings
Published by Association for Computing Machinery (ACM) ,2016
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016
Convolutional Two-Stream Network Fusion for Video Action Recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
WIDER FACE: A Face Detection Benchmark
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Selective Transfer Machine for Personalized Facial Expression Analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016
Fisher Vector Faces in the Wild
Published by British Machine Vision Association and Society for Pattern Recognition ,2013
Context in Emotion Perception
Current Directions in Psychological Science, 2011

Cited by 5 articles