Login

ramina · 08-17-2017, 12:23 AM

Speech Emotion Recognition for Affective Human Robot Interaction

Abstract
We evaluate the performance of a speech emotion recognition method for affective human-robot interaction. In the proposed method, emotion is classified into 6 classes: Angry, bored, happy, neutral, sad, and surprised. After applying noise reduction and speech detection, we obtain a feature vector for an utterance from statistics of phonetic and prosodic information. The phonetic information includes log energy, shimmer, formant frequencies, and Teager energy; the prosodic information includes pitch, jitter, and rate of speech. Then a pattern classifier based on Gaussian support vector machines decides the emotion class of the utterance. To simulate a human-robot interaction situation, we record speech commands and dialogs uttered at 2m away from a microphone. Experimental results show that the proposed method achieves the classification accuracy of 58.6% while listeners give 60.4% with the reference labels given by speakers intention. On the other hand, the proposed method shows the classification accuracy of 51.2% with the reference labels given by the listeners majority decision.

Presented By:
Kwang-Dong Jang and Oh-Wook Kwon Department of Control and Instrumentation Engineering Chungbuk National University, Korea {kdjang,owkwon}@chungbuk.ac.kr

1. Introduction
A human conveys emotion as well as linguistic information via speech signals. The emotion in speech makes verbal communications natural, emphasizes a speaker s intention, and shows one s psychological state. Recently there has been a lot of research activities for affective human-robot interaction with a humanoid robot by recognizing the emotion expressed through facial images and speech. In particular, speech emotion recognition requires less hardware and computational complexity compared to facial emotion recognition. A speech emotion recognizer can be used in an interactive intelligent robot which responds appropriately to a user s command according to the user s emotional state. It can be also embedded in a music player which suggests a suitable music list to the user s emotional state. Emotion can be recognized by using acoustic information and/or linguistic information . Emotion recognition from linguistic information is done by spotting exclamatory words from input utterances and thus cannot be used when there are no exclamatory words. However, acoustic information extracted from speech signals is more flexible for emotion recognition than linguistic information because it does not require any speech recognition system to spot exclamatory words and can be extended to any other language. Among many features suggested for speech emotion recognition, we select the following acoustic information: pitch, energy, formats, tempo, duration, jitter, shimmer, mel frequency coefficient (MFCC), linear predictive coding (LPC) coefficient, and Teager energy. A pattern classifier based on support vector machines (SVM) classifies the motion by using the feature vector obtained from statistics of the acoustic information. We compare the performance of automatic emotion recognition when the reference labels are given by speakers and human listeners. This paper is organized as follows: Section 2 explains the base features extracted from speech and the pattern classifier. Section 3 describes the experimental results when the reference labels are supplied by human listeners and speakers. Section 4 concludes the paper.

for full report please see http://eurasipProceedings/Ext/SPECOM2006/papers/077.pdf