Adaptive Multimodal Fusion for Interpretable and Efficient Conversational Emotion Recognition

Title:Adaptive Multimodal Fusion for Interpretable and Efficient Conversational Emotion Recognition

Authors:Jihed Jabnoun, Mohsen Maraoui and Mounir Zrigui

Conference:ACIIDS2026

Tags:Adaptive fusion, Class imbalance, Conversational AI, Interpretable AI and Multimodal emotion recognition

Abstract:

Conversational emotion recognition requires integrating text, speech, and visual cues in real time, often under noisy and resource-constrained conditions. We propose an adaptive multimodal fusion method that learns how different emotions rely on different modalities while remaining efficient and interpretable. Three pretrained experts for language, speech, and vision are partially fine-tuned for conversational data and projected into a shared representation space, where cross-modal attention enables mutual feature refinement. An emotion-aware gating mechanism dynamically assigns modality weights per input, revealing a learned emotion–modality affinity matrix. Seven parallel classifiers then operate on the gated representations to capture emotion-specific decision boundaries and improve minority-class recognition. We evaluate our Adaptive Multimodal Fusion on MELD, our approach surpasses state-of-the-art accuracy with lower model size and inference cost, while exposing interpretable modality dominance patterns (e.g., happiness is text-dominant, disgust face-dominant), providing both performance gains and insight into multimodal emotion expression.