eduzhai > Applied Sciences > Engineering >

Generating Visually Aligned Sound from Videos

  • king
  • (0) Download
  • 20210507
  • Save

... pages left unread,continue reading

Document pages: 11 pages

Abstract: We focus on the task of generating sound from natural videos, and the soundshould be both temporally and content-wise aligned with visual signals. Thistask is extremely challenging because some sounds generated emph{outside} acamera can not be inferred from video content. The model may be forced to learnan incorrect mapping between visual content and these irrelevant sounds. Toaddress this challenge, we propose a framework named REGNET. In this framework,we first extract appearance and motion features from video frames to betterdistinguish the object that emits sound from complex background information. Wethen introduce an innovative audio forwarding regularizer that directlyconsiders the real sound as input and outputs bottlenecked sound features.Using both visual and bottlenecked sound features for sound prediction duringtraining provides stronger supervision for the sound prediction. The audioforwarding regularizer can control the irrelevant sound component and thusprevent the model from learning an incorrect mapping between video frames andsound emitted by the object that is out of the screen. During testing, theaudio forwarding regularizer is removed to ensure that REGNET can producepurely aligned sound only from visual features. Extensive evaluations based onAmazon Mechanical Turk demonstrate that our method significantly improves bothtemporal and content-wise alignment. Remarkably, our generated sound can foolthe human with a 68.12 success rate. Code and pre-trained models are publiclyavailable at this https URL

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...