eduzhai > Applied Sciences > Engineering >

Speaker Conditional WaveRNN Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: Recent advancements in deep learning led to human-level performance insingle-speaker speech synthesis. However, there are still limitations in termsof speech quality when generalizing those systems into multiple-speaker modelsespecially for unseen speakers and unseen recording qualities. For instance,conventional neural vocoders are adjusted to the training speaker and have poorgeneralization capabilities to unseen speakers. In this work, we propose avariant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN). Wetarget towards the development of an efficient universal vocoder even forunseen speakers and recording conditions. In contrast to standard WaveRNN,SC-WaveRNN exploits additional information given in the form of speakerembeddings. Using publicly-available data for training, SC-WaveRNN achievessignificantly better performance over baseline WaveRNN on both subjective andobjective metrics. In MOS, SC-WaveRNN achieves an improvement of about 23 forseen speaker and seen recording condition and up to 95 for unseen speaker andunseen condition. Finally, we extend our work by implementing a multi-speakertext-to-speech (TTS) synthesis similar to zero-shot speaker adaptation. Interms of performance, our system has been preferred over the baseline TTSsystem by 60 over 15.5 and by 60.9 over 32.6 , for seen and unseen speakers,respectively.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...