eduzhai > Applied Sciences > Engineering >

Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

  • king
  • (0) Download
  • 20210507
  • Save

... pages left unread,continue reading

Document pages: 10 pages

Abstract: The major challenge in audio-visual event localization task lies in how tofuse information from multiple modalities effectively. Recent works have shownthat attention mechanism is beneficial to the fusion process. In this paper, wepropose a novel joint attention mechanism with multimodal fusion methods foraudio-visual event localization. Particularly, we present a concise yet validarchitecture that effectively learns representations from multiple modalitiesin a joint manner. Initially, visual features are combined with auditoryfeatures and then turned into joint representations. Next, we make use of thejoint representations to attend to visual features and auditory features,respectively. With the help of this joint co-attention, new visual and auditoryfeatures are produced, and thus both features can enjoy the mutually improvedbenefits from each other. It is worth noting that the joint co-attention unitis recursive meaning that it can be performed multiple times for obtainingbetter joint representations progressively. Extensive experiments on the publicAVE dataset have shown that the proposed method achieves significantly betterresults than the state-of-the-art methods.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...