Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network

Top Cited Papers

23 March 2017

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

Abstract

This paper presents a method for speech emotion recognition using spectrograms and deep convolutional neural network (CNN). Spectrograms generated from the speech signals are input to the deep CNN. The proposed model consisting of three convolutional layers and three fully connected layers extract discriminative features from spectrogram images and outputs predictions for the seven emotions. In this study, we trained the proposed model on spectrograms obtained from Berlin emotions dataset. Furthermore, we also investigated the effectiveness of transfer learning for emotions recognition using a pre-trained AlexNet model. Preliminary results indicate that the proposed approach based on freshly trained model is better than the fine-tuned model, and is capable of predicting emotions accurately and efficiently.

This publication has 18 references indexed in Scilit:

ImageNet Large Scale Visual Recognition Challenge
International Journal of Computer Vision, 2015
Caffe
Published by Association for Computing Machinery (ACM) ,2014
Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks
IEEE Transactions on Multimedia, 2014
Automatic speaker age and gender recognition using acoustic and prosodic level information fusion
Computer Speech & Language, 2013
Survey on speech emotion recognition: Features, classification schemes, and databases
Pattern Recognition, 2011
Modeling prosodic feature sequences for speaker recognition
Speech Communication, 2005
Speech Enhancement Using Perceptual Wavelet Packet Decomposition and Teager Energy Operator
Published by Springer Science and Business Media LLC ,2004
The production and recognition of emotions in speech: features and algorithms
International Journal of Human-Computer Studies, 2003
Acoustical properties of speech as indicators of depression and suicidal risk
IEEE Transactions on Biomedical Engineering, 2000
ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments ☆
Speech Communication, 1995

Cited by 211 articles