eduzhai > Applied Sciences > Engineering >

Telling Left from Right Learning Spatial Correspondence of Sight and Sound

  • Save

... pages left unread,continue reading

Document pages: 14 pages

Abstract: Self-supervised audio-visual learning aims to capture useful representationsof video by leveraging correspondences between visual and audio inputs.Existing approaches have focused primarily on matching semantic informationbetween the sensory streams. We propose a novel self-supervised task toleverage an orthogonal principle: matching spatial information in the audiostream to the positions of sound sources in the visual stream. Our approach issimple yet effective. We train a model to determine whether the left and rightaudio channels have been flipped, forcing it to reason about spatiallocalization across the visual and audio streams. To train and evaluate ourmethod, we introduce a large-scale video dataset, YouTube-ASMR-300K, withspatial audio comprising over 900 hours of footage. We demonstrate thatunderstanding spatial correspondence enables models to perform better on threeaudio-visual tasks, achieving quantitative gains over supervised andself-supervised baselines that do not leverage spatial audio cues. We also showhow to extend our self-supervised approach to 360 degree videos with ambisonicaudio.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...