Huang, X., Tan, H. L., Leong, M. C., Sun, Y., Li, L., Jiang, R., Kim, J. J. Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog. DSTC10 workshop at AAAI-22. 2022
Abstract:
In this report, we present our submissions to the DSTC10 Audio Visual Scene Dialog (AVSD) challenge. We investigated variants of an encoder-decoder model, including those with multi-modal cross-attention and those with various fusion strategies to aggregate the multi-modal inputs (audio, visual, text, object). Our submissions achieved competitive results in the two tasks of the AVSD challenge. For the first task (video Q&A dialog), our submissions achieved BLEU4, METEOR, ROUGE, CIDEr, and human rating of 37.2%, 24.3%, 53.0%, 91.2%, and 3.57 respectively. For the second task (reasoning for Q&A), our submissions achieved IoU-1 and IoU-2 of 48.5% and 51.0% respectively. Our submissions, under Team Anonymous, achieved the top rank for the human rating, the third rank for the automatic evaluation of the first task, and the second rank for the second task.
License type:
Publisher Copyright
Funding Info:
This research is supported by core funding from: AME Programmatic Funding
Grant Reference no. : A18A2b0046