eduzhai > Applied Sciences > Engineering >

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: We consider the design of two-pass voice trigger detection systems. We focuson the networks in the second pass that are used to re-score candidate segmentsobtained from the first-pass. Our baseline is an acoustic model(AM), withBiLSTM layers, trained by minimizing the CTC loss. We replace the BiLSTM layerswith self-attention layers. Results on internal evaluation sets show thatself-attention networks yield better accuracy while requiring fewer parameters.We add an auto-regressive decoder network on top of the self-attention layersand jointly minimize the CTC loss on the encoder and the cross-entropy loss onthe decoder. This design yields further improvements over the baseline. Weretrain all the models above in a multi-task learning(MTL) setting, where onebranch of a shared network is trained as an AM, while the second branchclassifies the whole sequence to be true-trigger or not. Results demonstratethat networks with self-attention layers yield $ sim$60 relative reduction infalse reject rates for a given false-alarm rate, while requiring 10 fewerparameters. When trained in the MTL setup, self-attention networks yieldfurther accuracy improvements. On-device measurements show that we observe 70 relative reduction in inference time. Additionally, the proposed networkarchitectures are $ sim$5X faster to train.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...