Feature Fusion for Multimodal Emotion Recognition Based on Deep Canonical Correlation Analysis

14 September 2021

journal article
research article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Signal Processing Letters

Vol. 28 (10709908), 1898-1902
https://doi.org/10.1109/lsp.2021.3112314

Abstract

Fusion of multimodal features is a momentous problem for video emotion recognition. As the development of deep learning, directly fusing feature matrixes of each mode through neural networks at feature level becomes mainstream method. However, unlike unimodal issues, for multimodal analysis, finding the correlations between different modal is as important as discovering effective unimodal features. To make up the deficiency in unearthing the intrinsic relationships between multimodal, a novel modularized multimodal emotion recognition model based on deep canonical correlation analysis (MERDCCA) is proposed in this letter. In MERDCCA, four utterances are gathered as a new group and each utterance contains text, audio and visual information as multimodal input. Gated recurrent unit layers are used to extract the unimodal features. Deep canonical correlation analysis based on encoder-decoder network is designed to extract cross-modal correlations by maximizing the relevance between multimodal. The experiments on two public datasets show that MERDCCA achieves the better results.

Funding Information

Basic Research Strengthening Program of China (2020-JCJQ-ZD-015-00-02)
National Natural Science Foundation for Distinguished Young Scholars (62025602)
National Natural Science Foundation of China (U1803263, 61871470, 11931015, 61502391)

This publication has 35 references indexed in Scilit:

A multimodal convolutional neuro-fuzzy network for emotion understanding of movie clips
Neural Networks, 2019
A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition
Information Fusion, 2019
Multimodal sentiment analysis using hierarchical fusion with context modeling
Knowledge-Based Systems, 2018
Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech Recognition
IEEE Signal Processing Letters, 2018
Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
Computer Vision and Image Understanding, 2018
ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection
Published by Association for Computational Linguistics (ACL) ,2018
Locality Adaptive Discriminant Analysis
Published by International Joint Conferences on Artificial Intelligence ,2017
Generalized Deep Transfer Networks for Knowledge Propagation in Heterogeneous Domains
ACM Transactions on Multimedia Computing, Communications, and Applications, 2016
Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation
Published by Association for Computing Machinery (ACM) ,2015
IEMOCAP: interactive emotional dyadic motion capture database
Language Resources and Evaluation, 2008

Cited by 17 articles