Multimodal and Joint Learning Generation Models for SIMMC 2.0

Page view(s)
Checked on Mar 25, 2024
Multimodal and Joint Learning Generation Models for SIMMC 2.0
Multimodal and Joint Learning Generation Models for SIMMC 2.0
Journal Title:
The Tenth Dialog System Technology Challenge (AAAI-22)
Publication Date:
28 February 2022
@article{simmc2-team3, title={Joint Generation and Bi-Encoder for Situated Interactive MultiModal Conversations}, author={Thanh-Tung Nguyen and Wei Shi and Ridong Jiang and Jung Jae Kim}, journal={AAAI 2022 DSTC10 Workshop}, year={2022} }
There is more and more interest in research of building an AI assistant that can communicate with human in a multimodal conversational setting. Most existing task-oriented dialog datasets do not coordinate the dialog in the user’s multimodal context. The Situated Interactive MultiModal Conversations datasets (SIMMC 1.0 and 2.0) were introduced recently to allow research to train machines to consider not only the dialog history but also the scene content. The datasets provide fully annotated dialogues where the user and the agent see the same scene elements and the agent can be decisive to update the scene. This year's SIMMC 2.0: Situated Interactive Multimodal Conversational AI challenge, held as the third track in the Tenth Dialog System Technology Challenge, pushes the development of many methods on SIMMC 2.0 dataset. In this paper, we present our approaches in five subtasks: disambiguation classification, multimodal coreference resolution prediction, dialog state tracking, response generation and response retrieval. For subtasks 1, 3 and 4, we combine all the outputs into a single string and adapt BART-based encoder-decoder framework to make the predictions. For subtask 2, we propose Bi-encoder \& Poly-encoder model to match the visual objects with the dialogue turns. For subtask 5, we applied BART-based framework to find more relevant responses among given candidates. Our models came in second place for the response retrieval subtask and a good place for the others. Our models' performance in both the public dev-test and the private test-std datasets shows the robustness of our approaches.
License type:
Publisher Copyright
Funding Info:
This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project #A18A2b0046)
In the website
Files uploaded: