eduzhai > Applied Sciences > Engineering >

Weakly Supervised Construction of ASR Systems with Massive Video Data

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: Building Automatic Speech Recognition (ASR) systems from scratch issignificantly challenging, mostly due to the time-consuming andfinancially-expensive process of annotating a large amount of audio data withtranscripts. Although several unsupervised pre-training models have beenproposed, applying such models directly might still be sub-optimal if morelabeled, training data could be obtained without a large cost. In this paper,we present a weakly supervised framework for constructing ASR systems withmassive video data. As videos often contain human-speech audios aligned withsubtitles, we consider videos as an important knowledge source, and propose aneffective approach to extract high-quality audios aligned with transcripts fromvideos based on Optical Character Recognition (OCR). The underlying ASR modelcan be fine-tuned to fit any domain-specific target training datasets afterweakly supervised pre-training. Extensive experiments show that our frameworkcan easily produce state-of-the-art results on six public datasets for Mandarinspeech recognition.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...