eduzhai > Applied Sciences > Engineering >

End-to-End Adversarial Text-to-Speech

  • Save

... pages left unread,continue reading

Document pages: 23 pages

Abstract: Modern text-to-speech synthesis pipelines typically involve multipleprocessing stages, each of which is designed or learnt independently from therest. In this work, we take on the challenging task of learning to synthesisespeech from normalised text or phonemes in an end-to-end manner, resulting inmodels which operate directly on character or phoneme input sequences andproduce raw speech audio outputs. Our proposed generator is feed-forward andthus efficient for both training and inference, using a differentiablealignment scheme based on token length prediction. It learns to produce highfidelity audio through a combination of adversarial feedback and predictionlosses constraining the generated audio to roughly match the ground truth interms of its total duration and mel-spectrogram. To allow the model to capturetemporal variation in the generated audio, we employ soft dynamic time warpingin the spectrogram-based prediction loss. The resulting model achieves a meanopinion score exceeding 4 on a 5 point scale, which is comparable to thestate-of-the-art models relying on multi-stage training and additionalsupervision.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...