eduzhai > Applied Sciences > Engineering >

Self-attention encoding and pooling for speaker recognition

  • king
  • (0) Download
  • 20210506
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: The computing power of mobile devices limits the end-user applications interms of storage size, processing, memory and energy consumption. Theselimitations motivate researchers for the design of more efficient deep models.On the other hand, self-attention networks based on Transformer architecturehave attracted remarkable interests due to their high parallelizationcapabilities and strong performance on a variety of Natural Language Processing(NLP) applications. Inspired by the Transformer, we propose a tandemSelf-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminativespeaker embedding given non-fixed length speech utterances. SAEP is a stack ofidentical blocks solely relied on self-attention and position-wise feed-forwardnetworks to create vector representation of speakers. This approach encodesshort-term speaker spectral features into speaker embeddings to be used intext-independent speaker verification. We have evaluated this approach on bothVoxCeleb1 & 2 datasets. The proposed architecture is able to outperform thebaseline x-vector, and shows competitive performance to some other benchmarksbased on convolutions, with a significant reduction in model size. It employs94 , 95 , and 73 less parameters compared to ResNet-34, ResNet-50, andx-vector, respectively. This indicates that the proposed fully attention basedarchitecture is more efficient in extracting time-invariant features fromspeaker utterances.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...