Download PDFOpen PDF in browser

MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

EasyChair Preprint no. 4850

6 pagesDate: January 2, 2021


The purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip movement and facial expressions, because the visual aspect of speech is essentially unaffected by acoustic environment. In order to fuse audio and visual information, an audio-visual fusion strategy is proposed, which goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to more powerful representation which increase intelligibility in noisy conditions. The proposed model fuses audio-visual features layer by layer, and feed these audio-visual features to each corresponding decoding layer. Experiment results show relative improvement from 6% to 24% on test sets over the audio modality alone, depending on audio noise level. Moreover, there is a significant increase of PESQ from 1.21 to 2.06 in our -15dB SNR experiment.

Keyphrases: audio-visual, multi-layer feature fusion convolution network(MFFCN), speech enhancement

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
  author = {Xinmeng Xu and Dongxiang Xu and Jie Jia and Yang Wang and Binbin Chen},
  title = {MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement},
  howpublished = {EasyChair Preprint no. 4850},

  year = {EasyChair, 2021}}
Download PDFOpen PDF in browser