Feature Fusion for Multimodal Emotion Recognition Based on Deep Canonical Correlation Analysis
- 14 September 2021
- journal article
- research article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Signal Processing Letters
- Vol. 28 (10709908), 1898-1902
- https://doi.org/10.1109/lsp.2021.3112314
Abstract
Fusion of multimodal features is a momentous problem for video emotion recognition. As the development of deep learning, directly fusing feature matrixes of each mode through neural networks at feature level becomes mainstream method. However, unlike unimodal issues, for multimodal analysis, finding the correlations between different modal is as important as discovering effective unimodal features. To make up the deficiency in unearthing the intrinsic relationships between multimodal, a novel modularized multimodal emotion recognition model based on deep canonical correlation analysis (MERDCCA) is proposed in this letter. In MERDCCA, four utterances are gathered as a new group and each utterance contains text, audio and visual information as multimodal input. Gated recurrent unit layers are used to extract the unimodal features. Deep canonical correlation analysis based on encoder-decoder network is designed to extract cross-modal correlations by maximizing the relevance between multimodal. The experiments on two public datasets show that MERDCCA achieves the better results.Funding Information
- Basic Research Strengthening Program of China (2020-JCJQ-ZD-015-00-02)
- National Natural Science Foundation for Distinguished Young Scholars (62025602)
- National Natural Science Foundation of China (U1803263, 61871470, 11931015, 61502391)
This publication has 35 references indexed in Scilit:
- A multimodal convolutional neuro-fuzzy network for emotion understanding of movie clipsNeural Networks, 2019
- A snapshot research and implementation of multimodal information fusion for data-driven emotion recognitionInformation Fusion, 2019
- Multimodal sentiment analysis using hierarchical fusion with context modelingKnowledge-Based Systems, 2018
- Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech RecognitionIEEE Signal Processing Letters, 2018
- Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognitionComputer Vision and Image Understanding, 2018
- ICON: Interactive Conversational Memory Network for Multimodal Emotion DetectionPublished by Association for Computational Linguistics (ACL) ,2018
- Locality Adaptive Discriminant AnalysisPublished by International Joint Conferences on Artificial Intelligence ,2017
- Generalized Deep Transfer Networks for Knowledge Propagation in Heterogeneous DomainsACM Transactions on Multimedia Computing, Communications, and Applications, 2016
- Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge PropagationPublished by Association for Computing Machinery (ACM) ,2015
- IEMOCAP: interactive emotional dyadic motion capture databaseLanguage Resources and Evaluation, 2008