eduzhai > Applied Sciences > Engineering >

Deep Variational Generative Models for Audio-visual Speech Separation

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: In this paper, we are interested in audio-visual speech separation given asingle-channel audio recording as well as visual information (lips movements)associated with each speaker. We propose an unsupervised technique based onaudio-visual generative modeling of clean speech. More specifically, duringtraining, a latent variable generative model is learned from clean speechspectrograms using a variational auto-encoder (VAE). To better utilize thevisual information, the posteriors of the latent variables are inferred frommixed speech (instead of clean speech) as well as the visual data. The visualmodality also serves as a prior for latent variables, through a visual network.At test time, the learned generative model (both for speaker-independent andspeaker-dependent scenarios) is combined with an unsupervised non-negativematrix factorization (NMF) variance model for background noise. All the latentvariables and noise parameters are then estimated by a Monte Carloexpectation-maximization algorithm. Our experiments show that the proposedunsupervised VAE-based method yields better separation performance thanNMF-based approaches as well as a supervised deep learning-based technique.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...