eduzhai > Applied Sciences > Engineering >

Neural text-to-speech with a modeling-by-generation excitation vocoder

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: This paper proposes a modeling-by-generation (MbG) excitation vocoder for aneural text-to-speech (TTS) system. Recently proposed neural excitationvocoders can realize qualified waveform generation by combining a vocal tractfilter with a WaveNet-based glottal excitation generator. However, when thesevocoders are used in a TTS system, the quality of synthesized speech is oftendegraded owing to a mismatch between training and synthesis steps.Specifically, the vocoder is separately trained from an acoustic modelfront-end. Therefore, estimation errors of the acoustic model are inevitablyboosted throughout the synthesis process of the vocoder back-end. To addressthis problem, we propose to incorporate an MbG structure into the vocoder straining process. In the proposed method, the excitation signal is extracted bythe acoustic model s generated spectral parameters, and the neural vocoder isthen optimized not only to learn the target excitation s distribution but alsoto compensate for the estimation errors occurring from the acoustic model.Furthermore, as the generated spectral parameters are shared in the trainingand synthesis steps, their mismatch conditions can be reduced effectively. Theexperimental results verify that the proposed system provides high-qualitysynthetic speech by achieving a mean opinion score of 4.57 within the TTSframework.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...