eduzhai > Applied Sciences > Engineering >

Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020

  • Save

... pages left unread,continue reading

Document pages: 4 pages

Abstract: This report describes our model for VATEX Captioning Challenge 2020. First,to gather information from multiple domains, we extract motion, appearance,semantic and audio features. Then we design a feature attention module toattend on different feature when decoding. We apply two types of decoders,top-down and X-LAN and ensemble these models to get the final result. Theproposed method outperforms official baseline with a significant gap. Weachieve 76.0 CIDEr and 50.0 CIDEr on English and Chinese private test set. Werank 2nd on both English and Chinese private test leaderboard.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...