Multimodal Speech Emotion Recognition Using Audio and Text

1 December 2018

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 112-118
https://doi.org/10.1109/slt.2018.8639583

Abstract

Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.

Keywords

This publication has 19 references indexed in Scilit:

Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
Audio-based multimedia event detection using deep recurrent neural networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Effective Approaches to Attention-based Neural Machine Translation
Published by Association for Computational Linguistics (ACL) ,2015
Glove: Global Vectors for Word Representation
Published by Association for Computational Linguistics (ACL) ,2014
Recent developments in openSMILE, the munich open-source multimedia feature extractor
Published by Association for Computing Machinery (ACM) ,2013
Speech emotion recognition using Support Vector Machines
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Emotion recognition using a hierarchical binary decision tree approach
Speech Communication, 2011
IEMOCAP: interactive emotional dyadic motion capture database
Language Resources and Evaluation, 2008
Connectionist temporal classification
Published by Association for Computing Machinery (ACM) ,2006
NLTK
Published by Association for Computational Linguistics (ACL) ,2004

Cited by 142 articles