Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Abstract: Recent advances in deep learning methods have elevated synthetic speechquality to human level, and the field is now moving towards addressing prosodicvariation in synthetic speech.Despite successes in this effort, thestate-of-the-art systems fall short of faithfully reproducing local prosodicevents that give rise to, e.g., word-level emphasis and phrasal structure. Thistype of prosodic variation often reflects long-distance semantic relationshipsthat are not accessible for end-to-end systems with a single sentence as theirsynthesis domain. One of the possible solutions might be conditioning thesynthesized speech by explicit prosodic labels, potentially generated usinglonger portions of text. In this work we evaluate whether augmenting thetextual input with such prosodic labels capturing word-level prominence andphrasal boundary strength can result in more accurate realization of sentenceprosody. We use an automatic wavelet-based technique to extract such labelsfrom speech material, and use them as an input to a tacotron-like synthesissystem alongside textual information. The results of objective evaluation ofsynthesized speech show that using the prosodic labels significantly improvesthe output in terms of faithfulness of f0 and energy contours, in comparisonwith state-of-the-art implementations.

