eduzhai > Applied Sciences > Engineering >

Expressive TTS Training with Frame and Style Reconstruction Loss

  • king
  • (0) Download
  • 20210507
  • Save

... pages left unread,continue reading

Document pages: 13 pages

Abstract: We propose a novel training strategy for Tacotron-based text-to-speech (TTS)system to improve the expressiveness of speech. One of the key challenges inprosody modeling is the lack of reference that makes explicit modelingdifficult. The proposed technique doesn t require prosody annotations fromtraining data. It doesn t attempt to model prosody explicitly either, butrather encodes the association between input text and its prosody styles usinga Tacotron-based TTS framework. Our proposed idea marks a departure from thestyle token paradigm where prosody is explicitly modeled by a bank of prosodyembeddings. The proposed training strategy adopts a combination of twoobjective functions: 1) frame level reconstruction loss, that is calculatedbetween the synthesized and target spectral features; 2) utterance level stylereconstruction loss, that is calculated between the deep style features ofsynthesized and target speech. The proposed style reconstruction loss isformulated as a perceptual loss to ensure that utterance level speech style istaken into consideration during training. Experiments show that the proposedtraining strategy achieves remarkable performance and outperforms astate-of-the-art baseline in both naturalness and expressiveness. To our bestknowledge, this is the first study to incorporate utterance level perceptualquality as a loss function into Tacotron training for improved expressiveness.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...