eduzhai > Applied Sciences > Engineering >

Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition

  • king
  • (0) Download
  • 20210505
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: For many small- and medium-vocabulary tasks, audio-visual speech recognitioncan significantly improve the recognition rates compared to audio-only systems.However, there is still an ongoing debate regarding the best combinationstrategy for multi-modal information, which should allow for the translation ofthese gains to large-vocabulary recognition. While an integration at the levelof state-posterior probabilities, using dynamic stream weighting, is almostuniversally helpful for small-vocabulary systems, in large-vocabulary speechrecognition, the recognition accuracy remains difficult to improve. In thefollowing, we specifically consider the large-vocabulary task of the LRS2database, and we investigate a broad range of integration strategies, comparingearly integration and end-to-end learning with many versions of hybridrecognition and dynamic stream weighting. One aspect, which is shown to providemuch benefit here, is the use of dynamic stream reliability indicators, whichallow for hybrid architectures to strongly profit from the inclusion of visualinformation whenever the audio channel is distorted even slightly.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...