eduzhai > Applied Sciences > Engineering >

Audiovisual Speech Synthesis using Tacotron2

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 18 pages

Abstract: Audiovisual speech synthesis is the problem of synthesizing a talking facewhile maximizing the coherency of the acoustic and visual speech. In thispaper, we propose and compare two audiovisual speech synthesis systems for 3Dface models. The first system is the AVTacotron2, which is an end-to-endtext-to-audiovisual speech synthesizer based on the Tacotron2 architecture.AVTacotron2 converts a sequence of phonemes representing the sentence tosynthesize into a sequence of acoustic features and the correspondingcontrollers of a face model. The output acoustic features are used to conditiona WaveRNN to reconstruct the speech waveform, and the output facial controllersare used to generate the corresponding video of the talking face. The secondaudiovisual speech synthesis system is modular, where acoustic speech issynthesized from text using the traditional Tacotron2. The reconstructedacoustic speech signal is then used to drive the facial controls of the facemodel using an independently trained audio-to-facial-animation neural network.We further condition both the end-to-end and modular approaches on emotionembeddings that encode the required prosody to generate emotional audiovisualspeech. We analyze the performance of the two systems and compare them to theground truth videos using subjective evaluation tests. The end-to-end andmodular systems are able to synthesize close to human-like audiovisual speechwith mean opinion scores (MOS) of 4.1 and 3.9, respectively, compared to a MOSof 4.1 for the ground truth generated from professionally recorded videos.While the end-to-end system gives a better overall quality, the modularapproach is more flexible and the quality of acoustic speech and visual speechsynthesis is almost independent of each other.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...