eduzhai > Applied Sciences > Engineering >

Unified Multisensory Perception Weakly-Supervised Audio-Visual Video Parsing

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 19 pages

Abstract: In this paper, we introduce a new problem, named audio-visual video parsing,which aims to parse a video into temporal event segments and label them aseither audible, visible, or both. Such a problem is essential for a completeunderstanding of the scene depicted inside a video. To facilitate exploration,we collect a Look, Listen, and Parse (LLP) dataset to investigate audio-visualvideo parsing in a weakly-supervised manner. This task can be naturallyformulated as a Multimodal Multiple Instance Learning (MMIL) problem.Concretely, we propose a novel hybrid attention network to explore unimodal andcross-modal temporal contexts simultaneously. We develop an attentive MMILpooling method to adaptively explore useful audio and visual content fromdifferent temporal extent and modalities. Furthermore, we discover and mitigatemodality bias and noisy label issues with an individual-guided learningmechanism and label smoothing technique, respectively. Experimental resultsshow that the challenging audio-visual video parsing can be achieved even withonly video-level weak labels. Our proposed framework can effectively leverageunimodal and cross-modal temporal contexts and alleviate modality bias andnoisy labels problems.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...