eduzhai > Applied Sciences > Engineering >

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: In this work, we explore a multimodal semi-supervised learning approach forpunctuation prediction by learning representations from large amounts ofunlabelled audio and text data. Conventional approaches in speech processingtypically use forced alignment to encoder per frame acoustic features to wordlevel features and perform multimodal fusion of the resulting acoustic andlexical representations. As an alternative, we explore attention basedmultimodal fusion and compare its performance with forced alignment basedfusion. Experiments conducted on the Fisher corpus show that our proposedapproach achieves ~6-9 and ~3-4 absolute improvement (F1 score) over thebaseline BLSTM model on reference transcripts and ASR outputs respectively. Wefurther improve the model robustness to ASR errors by performing dataaugmentation with N-best lists which achieves up to an additional ~2-6 improvement on ASR outputs. We also demonstrate the effectiveness ofsemi-supervised learning approach by performing ablation study on various sizesof the corpus. When trained on 1 hour of speech and text data, the proposedmodel achieved ~9-18 absolute improvement over baseline model.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...