eduzhai > Applied Sciences > Engineering >

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 6 pages

Abstract: Audio-visual information fusion enables a performance improvement in speechrecognition performed in complex acoustic scenarios, e.g., noisy environments.It is required to explore an effective audio-visual fusion strategy foraudiovisual alignment and modality reliability. Different from the previousend-to-end approaches where the audio-visual fusion is performed after encodingeach modality, in this paper we propose to integrate an attentive fusion blockinto the encoding process. It is shown that the proposed audio-visual fusionmethod in the encoder module can enrich audio-visual representations, as therelevance between the two modalities is leveraged. In line with thetransformer-based architecture, we implement the embedded fusion block using amulti-head attention based audiovisual fusion with one-way or two-wayinteractions. The proposed method can sufficiently combine the two streams andweaken the over-reliance on the audio modality. Experiments on the LRS3-TEDdataset demonstrate that the proposed method can increase the recognition rateby 0.55 , 4.51 and 4.61 on average under the clean, seen and unseen noiseconditions, respectively, compared to the state-of-the-art approach.

Please select stars to rate!

         

0 comments Sign in to leave a comment.

    Data loading, please wait...
×