Download PDFOpen PDF in browserCross-Modal Attention Network for Audio-Visual Event LocalizationEasyChair Preprint 1181210 pages•Date: January 19, 2024AbstractAudio-visual event localization has been a hot research topic of computational scene analysis and machine perception, whose aim is to predict which temporal segment of an input video has an audio-visual event and what category the event belongs to. Learning effective fusion of multi-modality features is the key to audio-visual event localization. In this paper, we propose a novel cross-modal attention network based on the self-attention mechanism to extract effective audio and visual features for audio-visual event localization. Specifically, we propose a Dynamic Fusion with Intra- and Inter-modality module(DFIIA), which alternatively passes dynamic information between and across the audio and visual modalities. Furthermore, an audio-guided visual attention is utilized to guide the model to focus on event-relevant visual regions. We validate our proposed method on Audio-Visual Event(AVE) Dataset. Extensive experimental results demonstrate the efficiency of our proposed method, which outperforms state-of-the-arts in supervised AVE settings. Keyphrases: Audio-Visual Event Localization, Dynamic attention, Intra- and Inter-modality attention, cross-modal
|