eduzhai > Applied Sciences > Engineering >

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

  • king
  • (0) Download
  • 20210505
  • Save

... pages left unread,continue reading

Document pages: 5 pages

Abstract: Audio captioning is a multi-modal task, focusing on using natural languagefor describing the contents of general audio. Most audio captioning methods arebased on deep neural networks, employing an encoder-decoder scheme and adataset with audio clips and corresponding natural language descriptions (i.e.captions). A significant challenge for audio captioning is the distribution ofwords in the captions: some words are very frequent but acousticallynon-informative, i.e. the function words (e.g. "a ", "the "), and other words areinfrequent but informative, i.e. the content words (e.g. adjectives, nouns). Inthis paper we propose two methods to mitigate this class imbalance problem.First, in an autoencoder setting for audio captioning, we weigh each word scontribution to the training loss inversely proportional to its number ofoccurrences in the whole dataset. Secondly, in addition to multi-class,word-level audio captioning task, we define a multi-label side task based onclip-level content word detection by training a separate decoder. We use theloss from the second task to regularize the jointly trained encoder for theaudio captioning task. We evaluate our method using Clotho, a recentlypublished, wide-scale audio captioning dataset, and our results show anincrease of 37 relative improvement with SPIDEr metric over the baselinemethod.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...