Tags:Affective computing, Feature extraction, Machine learning, Natural language processing and Speech emotion recognition
Abstract:
Speech Emotion Recognition (SER) is a commonly employed technique in the field of research for identifying emotions through speech. This paper explores the use of the XLSR-53 model and UMAP across four datasets (ESD, EMOVO, EMODB, and emoUERJ) for SER. The study aimed to assess the model's effectiveness in recognizing emotions across different spoken languages and linguistic variations, examining the model's ability to extract features from speech signals. The XLSR-53 model, known for its robust generalization capabilities, was applied alongside UMAP for dimensionality reduction. This combination allowed for evaluating the model's proficiency in distinguishing emotions such as happiness, anger, sadness, and surprise across a variety of languages, including English, Chinese, Italian, German, and Brazilian Portuguese. The datasets offered a broad range of speakers and emotional expressions, providing comprehensive testing and comparisons for the model. Through careful analysis, we demonstrated how the model generalizes effectively without requiring pre-training or fine-tuning, solidifying its potential as a powerful tool for speech emotion recognition, particularly in multilingual and multicultural contexts.
Exploring Speech Emotion Recognition Based on XLSR-53 Model and UMAP