eduzhai > Applied Sciences > Engineering >

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

  • king
  • (0) Download
  • 20210507
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: In this study, we propose the global context guided channel andtime-frequency transformations to model the long-range, non-localtime-frequency dependencies and channel variances in speaker representations.We use the global context information to enhance important channels andrecalibrate salient time-frequency locations by computing the similaritybetween the global context and local features. The proposed modules, togetherwith a popular ResNet based model, are evaluated on the VoxCeleb1 dataset,which is a large scale speaker verification corpus collected in the wild. Thislightweight block can be easily incorporated into a CNN model with littleadditional computational costs and effectively improves the speakerverification performance compared to the baseline ResNet-LDE model and theSqueeze&Excitation block by a large margin. Detailed ablation studies are alsoperformed to analyze various factors that may impact the performance of theproposed modules. We find that by employing the proposed L2-tf-GTFCtransformation block, the Equal Error Rate decreases from 4.56 to 3.07 , arelative 32.68 reduction, and a relative 27.28 improvement in terms of theDCF score. The results indicate that our proposed global context guidedtransformation modules can efficiently improve the learned speakerrepresentations by achieving time-frequency and channel-wise featurerecalibration.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...