Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog

Page view(s)

164

Checked on Sep 11, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/18210

Title:

Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog

Journal Title:

DSTC10 workshop at AAAI-22

DOI:

Publication URL:

https://dstc10.dstc.community/aaai-22-workshop

Authors:

Xin Huang, Hui Li Tan, Mei Chee Leong, Ying Sun, Liyuan Li, Ridong Jiang, Jung-jae Kim

Keywords:

dialogue system, Visual Question Answering, Multimodal

Publication Date:

01 March 2022

Citation:

Huang, X., Tan, H. L., Leong, M. C., Sun, Y., Li, L., Jiang, R., Kim, J. J. Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog. DSTC10 workshop at AAAI-22. 2022

Abstract:

In this report, we present our submissions to the DSTC10 Audio Visual Scene Dialog (AVSD) challenge. We investigated variants of an encoder-decoder model, including those with multi-modal cross-attention and those with various fusion strategies to aggregate the multi-modal inputs (audio, visual, text, object). Our submissions achieved competitive results in the two tasks of the AVSD challenge. For the first task (video Q&A dialog), our submissions achieved BLEU4, METEOR, ROUGE, CIDEr, and human rating of 37.2%, 24.3%, 53.0%, 91.2%, and 3.57 respectively. For the second task (reasoning for Q&A), our submissions achieved IoU-1 and IoU-2 of 48.5% and 51.0% respectively. Our submissions, under Team Anonymous, achieved the top rank for the human rating, the third rank for the automatic evaluation of the first task, and the second rank for the second task.

License type:

Publisher Copyright

Funding Info:

This research is supported by core funding from: AME Programmatic Funding
Grant Reference no. : A18A2b0046

Description:

URI:

https://oar.a-star.edu.sg/communities-collections/articles/18210

ISBN:

W16-29

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
avsd-no-copyright.pdf	252.26 KB	PDF	Open