Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog

Page view(s)
136
Checked on Sep 23, 2024
Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog
Title:
Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog
Journal Title:
DSTC10 workshop at AAAI-22
DOI:
Publication Date:
01 March 2022
Citation:
Huang, X., Tan, H. L., Leong, M. C., Sun, Y., Li, L., Jiang, R., Kim, J. J. Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog. DSTC10 workshop at AAAI-22. 2022
Abstract:
In this report, we present our submissions to the DSTC10 Audio Visual Scene Dialog (AVSD) challenge. We investigated variants of an encoder-decoder model, including those with multi-modal cross-attention and those with various fusion strategies to aggregate the multi-modal inputs (audio, visual, text, object). Our submissions achieved competitive results in the two tasks of the AVSD challenge. For the first task (video Q&A dialog), our submissions achieved BLEU4, METEOR, ROUGE, CIDEr, and human rating of 37.2%, 24.3%, 53.0%, 91.2%, and 3.57 respectively. For the second task (reasoning for Q&A), our submissions achieved IoU-1 and IoU-2 of 48.5% and 51.0% respectively. Our submissions, under Team Anonymous, achieved the top rank for the human rating, the third rank for the automatic evaluation of the first task, and the second rank for the second task.
License type:
Publisher Copyright
Funding Info:
This research is supported by core funding from: AME Programmatic Funding
Grant Reference no. : A18A2b0046
Description:
ISBN:
W16-29
Files uploaded:

File Size Format Action
avsd-no-copyright.pdf 252.26 KB PDF Open