eduzhai > Applied Sciences > Engineering >

Conv-Transformer Transducer Low Latency Low Frame Rate Streamable End-to-End Speech Recognition

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: Transformer has achieved competitive performance against state-of-the-artend-to-end models in automatic speech recognition (ASR), and requiressignificantly less training time than RNN-based models. The originalTransformer, with encoder-decoder architecture, is only suitable for offlineASR. It relies on an attention mechanism to learn alignments, and encodes inputaudio bidirectionally. The high computation cost of Transformer decoding alsolimits its use in production streaming systems. To make Transformer suitablefor streaming ASR, we explore Transducer framework as a streamable way to learnalignments. For audio encoding, we apply unidirectional Transformer withinterleaved convolution layers. The interleaved convolution layers are used formodeling future context which is important to performance. To reducecomputation cost, we gradually downsample acoustic input, also with theinterleaved convolution layers. Moreover, we limit the length of historycontext in self-attention to maintain constant computation cost for eachdecoding step. We show that this architecture, named Conv-TransformerTransducer, achieves competitive performance on LibriSpeech dataset (3.6 WERon test-clean) without external language models. The performance is comparableto previously published streamable Transformer Transducer and strong hybridstreaming ASR systems, and is achieved with smaller look-ahead window (140~ms),fewer parameters and lower frame rate.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...