Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Page view(s)

Checked on Sep 09, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/21791

Title:

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Journal Title:

Advances in Neural Information Processing Systems

DOI:

Publication URL:

https://proceedings.neurips.cc/paper_files/paper/2024/hash/f9668d223e713943634dce9c66e8f2c1-Abstract-Conference.html

Authors:

Shantanu Jaiswal, Debaditya Roy, Basura Fernando, Cheston Tan

Keywords:

Publication Date:

08 December 2024

Citation:

Jaiswal, S., Roy, D., Fernando, B., & Tan, C. (2024). Learning to reason iteratively and parallelly for complex visual reasoning scenarios. In Advances in Neural Information Processing Systems, 37. https://proceedings.neurips.cc/paper_files/paper/2024/hash/f9668d223e713943634dce9c66e8f2c1-Abstract-Conference.html

Abstract:

Complex visual reasoning and question answering (VQA) is a challenging task that requires compositional multi-step processing and higher-level reasoning capabilities beyond the immediate recognition and localization of objects and events. Here, we introduce a fully neural Iterative and Parallel Reasoning Mechanism (IPRM) that combines two distinct forms of computation -- iterative and parallel -- to better address complex VQA scenarios. Specifically, IPRM's "iterative" computation facilitates compositional step-by-step reasoning for scenarios wherein individual operations need to be computed, stored, and recalled dynamically (e.g. when computing the query “determine the color of pen to the left of the child in red t-shirt sitting at the white table”). Meanwhile, its "parallel'' computation allows for the simultaneous exploration of different reasoning paths and benefits more robust and efficient execution of operations that are mutually independent (e.g. when counting individual colors for the query: "determine the maximum occurring color amongst all t-shirts'"). We design IPRM as a lightweight and fully-differentiable neural module that can be conveniently applied to both transformer and non-transformer vision-language backbones. It notably outperforms prior task-specific methods and transformer-based attention modules across various image and video VQA benchmarks testing distinct complex reasoning capabilities such as compositional spatiotemporal reasoning (AGQA), situational reasoning (STAR), multi-hop reasoning generalization (CLEVR-Humans) and causal event linking (CLEVRER-Humans). Further, IPRM's internal computations can be visualized across reasoning steps, aiding interpretability and diagnosis of its errors.

License type:

Publisher Copyright

Funding Info:

This research / project is supported by the National Research Foundation - NRF Fellowship
Grant Reference no. : NRF-NRFF14-2022-0001

This research / project is supported by the Agency for Science, Technology and Research (A*STAR), Science and Engineering Research Council - Central Research Fund (CRF)
Grant Reference no. :

This research / project is supported by the Agency for Science, Technology and Research (A*STAR) - Centre for Frontier AI Research (CFAR)
Grant Reference no. :

Description:

URI:

https://oar.a-star.edu.sg/communities-collections/articles/21791

ISBN:

Collections:

Institute of High Performance Computing

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
2024-neurips-iprm.pdf	12.46 MB	PDF	Open