eduzhai > Applied Sciences > Engineering >

A Spectral Energy Distance for Parallel Speech Synthesis

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 19 pages

Abstract: Speech synthesis is an important practical generative modeling problem thathas seen great progress over the last few years, with likelihood-basedautoregressive neural models now outperforming traditional concatenativesystems. A downside of such autoregressive models is that they requireexecuting tens of thousands of sequential operations per second of generatedaudio, making them ill-suited for deployment on specialized deep learninghardware. Here, we propose a new learning method that allows us to train highlyparallel models of speech, without requiring access to an analytical likelihoodfunction. Our approach is based on a generalized energy distance between thedistributions of the generated and real audio. This spectral energy distance isa proper scoring rule with respect to the distribution overmagnitude-spectrograms of the generated waveform audio and offers statisticalconsistency guarantees. The distance can be calculated from minibatches withoutbias, and does not involve adversarial learning, yielding a stable andconsistent method for training implicit generative models. Empirically, weachieve state-of-the-art generation quality among implicit generative models,as judged by the recently-proposed cFDSD metric. When combining our method withadversarial techniques, we also improve upon the recently-proposed GAN-TTSmodel in terms of Mean Opinion Score as judged by trained human evaluators.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...