Notic My Speech -- Blending Speech Patterns With Multimedia

Abstract: Speech as a natural signal is composed of three parts - visemes (visual partof speech), phonemes (spoken part of speech), and language (the imposedstructure). However, video as a medium for the delivery of speech and amultimedia construct has mostly ignored the cognitive aspects of speechdelivery. For example, video applications like transcoding and compression havetill now ignored the fact how speech is delivered and heard. To close the gapbetween speech understanding and multimedia video applications, in this paper,we show the initial experiments by modelling the perception on visual speechand showing its use case on video compression. On the other hand, in the visualspeech recognition domain, existing studies have mostly modeled it as aclassification problem, while ignoring the correlations between views,phonemes, visemes, and speech perception. This results in solutions which arefurther away from how human perception works. To bridge this gap, we propose aview-temporal attention mechanism to model both the view dependence and thevisemic importance in speech recognition and understanding. We conductexperiments on three public visual speech recognition datasets. Theexperimental results show that our proposed method outperformed the existingwork by 4.99 in terms of the viseme error rate. Moreover, we show that thereis a strong correlation between our model s understanding of multi-view speechand the human perception. This characteristic benefits downstream applicationssuch as video compression and streaming where a significant number of lessimportant frames can be compressed or eliminated while being able to maximallypreserve human speech understanding with good user experience.

