Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
Open Access
- 25 September 2021
- journal article
- research article
- Published by Hindawi Limited in Computational Intelligence and Neuroscience
- Vol. 2021, 1-11
- https://doi.org/10.1155/2021/5585041
Abstract
The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in different images may be different, which is often ignored. To address the problem of emotion difference between different modes and different images, this paper proposes a hierarchical attention-based multimodal fusion network for video emotion recognition, which consists of a multimodal feature extraction module and a multimodal feature fusion module. The multimodal feature extraction module has three subnetworks used to extract features of facial, scene, and global images. Each subnetwork consists of two branches, where the first branch extracts the features of different modes, and the other branch generates the emotion score for each image. Features and emotion scores of all images in a modal are aggregated to generate the emotion feature of the modal. The other module takes multimodal features as input and generates the emotion score for each modal. Finally, features and emotion scores of multiple modes are aggregated, and the final emotion representation of the video will be produced. Experimental results show that our proposed method is effective on the emotion recognition dataset.Keywords
Funding Information
- He’nan Educational Committee (21A520006, 182102310919)
This publication has 24 references indexed in Scilit:
- Emotion-modulated attention improves expression recognition: A deep learning modelNeurocomputing, 2017
- Video Analytics for Customer Emotion and Satisfaction at Contact CentersIEEE Transactions on Human-Machine Systems, 2017
- HoloNet: towards robust emotion recognition in the wildPublished by Association for Computing Machinery (ACM) ,2016
- Video Emotion Recognition with Transferred Deep Feature EncodingsPublished by Association for Computing Machinery (ACM) ,2016
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence, 2016
- Convolutional Two-Stream Network Fusion for Video Action RecognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- WIDER FACE: A Face Detection BenchmarkPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- Selective Transfer Machine for Personalized Facial Expression AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence, 2016
- Fisher Vector Faces in the WildPublished by British Machine Vision Association and Society for Pattern Recognition ,2013
- Context in Emotion PerceptionCurrent Directions in Psychological Science, 2011