Cross-Modal Attention Network for Audio-Visual Event Localization

EasyChair Preprint 11812

10 pages•Date: January 19, 2024

Han Liang, Jincai Chen, Jiangfeng Zeng, Tianming Jiang, Zheng Cheng and Ping Lu

Abstract

Audio-visual event localization has been a hot research topic of computational scene analysis and machine perception, whose aim is to predict which temporal segment of an input video has an audio-visual event and what category the event belongs to. Learning effective fusion of multi-modality features is the key to audio-visual event localization. In this paper, we propose a novel cross-modal attention network based on the self-attention mechanism to extract effective audio and visual features for audio-visual event localization. Specifically, we propose a Dynamic Fusion with Intra- and Inter-modality module(DFIIA), which alternatively passes dynamic information between and across the audio and visual modalities. Furthermore, an audio-guided visual attention is utilized to guide the model to focus on event-relevant visual regions. We validate our proposed method on Audio-Visual Event(AVE) Dataset. Extensive experimental results demonstrate the efficiency of our proposed method, which outperforms state-of-the-arts in supervised AVE settings.

Keyphrases: Audio-Visual Event Localization, Dynamic attention, Intra- and Inter-modality attention, cross-modal

Links:

https://easychair.org/publications/preprint/Slzp5

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:11812,
  author    = {Han Liang and Jincai Chen and Jiangfeng Zeng and Tianming Jiang and Zheng Cheng and Ping Lu},
  title     = {Cross-Modal Attention Network for Audio-Visual Event Localization},
  howpublished = {EasyChair Preprint 11812},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser