This paper presents our work on the fourth track in Dialog State Tracking Challenge 9 - Situated Interactive MultiModal Conversations (SIMMC) challenge, which aims at building virtual assistants that can handle multimodal inputs and perform multimodal actions. For this challenge, we propose an end-to-end encoder-decoder model based on BART for generating outputs of action prediction, response generation, and dialogue state tracking tasks in a single string, and another model based on Bi-encoders for response retrieval task. Our
models came in the first place for the action prediction and response retrieval tasks, and the second place for the response generation and dialogue state tracking tasks, achieving the first position of overall ranking in the challenge. In particular, our Bi-encoder models for the response retrieval task significantly outperformed the other entries of the challenge’s official evaluation. Furthermore, our models show similar performance on the two test datasets (devtest dataset whose ground truth are released publicly and test-std dataset whose ground-
truth are not released publicly), which shows the robustness of our models.
This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project #A18A2b0046)
Please find the electronic reference to the article in the publication URL provided.