This work demonstrates the feasibility and benefits of using point- ing gestures, a naturally-generated additional input modality, to improve the multi-modal comprehension accuracy of human in- structions to robotic agents for collaborative tasks. We present M2Gestic, a system that combines neural-based text parsing with a novel knowledge-graph traversal mechanism, over a multi-modal input of vision, natural language text and pointing. Via multiple studies related to a benchmark table top manipulation task, we show that (a) M2Gestic can achieve close-to-human performance in reasoning over unambiguous verbal instructions, and (b) incorpo- rating pointing input (even with its inherent location uncertainty) in M2Gestic results in a significant (∼ 30%) accuracy improvement when verbal instructions are ambiguous.
License type:
http://creativecommons.org/licenses/by-nc-nd/4.0/
Funding Info:
The National Research Foundation, Singapore under its International Research Centres in Singapore Funding Initiative, NRF Investigatorship (NRF-NRFI05-2019-0007) and Agency for Science, Technology and Research(A*STAR) under its AME Programmatic Funding Scheme (Project A18A2b0046)