Johari, K., Tong, C. T. Z., Subbaraju, V., Kim, J.-J., & Tan, U.-X. (2021). Gaze Assisted Visual Grounding. Lecture Notes in Computer Science, 191–202. doi:10.1007/978-3-030-90525-5_17
There has been an increasing demand for visual grounding in various human-robot interaction applications. However, the accuracy is often limited by the size of the dataset that can be collected, which is often a challenge. Hence, this paper proposes using the natural implicit input modality of human gaze to assist and improve the visual grounding accuracy of human instructions to robotic agents. To demonstrate the capability, mechanical gear objects are used. To achieve that, we utilized a transformer-based text classifier and a small corpus to develop a baseline phrase grounding model. We evaluate this phrase grounding system with and without gaze input to demonstrate the improvement. Gaze information (obtained from Microsoft Hololens2) improves the performance accuracy from 26% to 65%, leading to more efficient human-robot collaboration and applicable to hands-free scenarios. This approach is data-efficient as it requires only a small training dataset to ground the natural language referring expressions.
This research / project is supported by the A*STAR - AME Programmatic
Grant Reference no. : A18A2b0046
This is a post-peer-review, pre-copyedit version of an article published in Social Robotics. The final authenticated version is available online at: http://dx.doi.org/10.1007/978-3-030-90525-5_17