Task-Oriented Multi-Modal Question Answering For Collaborative Applications

Page view(s)

102

Checked on Aug 10, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/16927

Title:

Task-Oriented Multi-Modal Question Answering For Collaborative Applications

Journal Title:

2020 IEEE International Conference on Image Processing (ICIP)

DOI:

10.1109/ICIP40778.2020.9190659

Publication URL:

https://doi.org/10.1109/ICIP40778.2020.9190659

Authors:

Hui Li Tan, Mei Chee Leong, Qianli Xu, Liyuan Li, Fen Fang, Yi Cheng, Nicolas Gauthier, Ying Sun, Joo Hwee Lim

Keywords:

question answering, multi-modal grounding, human-robot collaboration, hybrid system, corpora

Publication Date:

30 September 2020

Citation:

H. L. Tan et al., "Task-Oriented Multi-Modal Question Answering For Collaborative Applications," 2020 IEEE International Conference on Image Processing (ICIP), 2020, pp. 1426-1430, doi: 10.1109/ICIP40778.2020.9190659.

Abstract:

Cobots that can work in human workspaces and adapt to human need to understand and respond to human’s inquiry and instruction. In this paper, we propose new question answering (QA) task and dataset for human-robot collaboration on task-oriented operation, i.e., task-oriented collaborative QA (TCQA). Differing from conventional video QA for answering questions about what happened in video clips constrained by scripts and subtitles, TC-QA aims to share common ground for task-oriented operation through question answering. We propose an open-end (OE) format of answer with text reply, image with annotated related objects, and video with operation duration to guide operation execution. Designed for grounding, the TC-QA dataset comprises query videos and questions to seek acknowledgement, correction, attention to task-related objects, and information on objects or operation. Due to the flexibility of real-world task with limited training sample, we propose and evaluate a baseline method based on a hybrid approach. The hybrid approach employs deep learning methods for object detection, hand detection and gesture recognition, and symbolic reasoning to ground question on observation for providing the answer. Our experiments show that the hybrid method is effective for the TC-QA task.

License type:

Publisher Copyright

Funding Info:

This research / project is supported by the Agency for Science, Technology and Research - AME Programmatic Funding Scheme
Grant Reference no. : A18A2b0046

This research / project is supported by the National Research Foundation, Singapore - NRF-ISF Joint Call
Grant Reference no. : NRF2015-NRF-ISF001-2541

Description:

© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

URI:

https://oar.a-star.edu.sg/communities-collections/articles/16927

ISSN:

2381-8549
1522-4880

ISBN:

978-1-7281-6395-6
978-1-7281-6394-9
978-1-7281-6396-3

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
20200522-icip-cameraready-task-oriented-multi-modal-question-answering.pdf	277.24 KB	PDF	Open