Situated Reasoning in Real-World Videos (STAR) is a new benchmark for evaluating situated reasoning ability through situation abstraction and logic-grounded question answering on real-world videos. In this paper, we present our submission to the STAR challenge which achieves the top-1 result for the situated question reasoning. We investigated two approaches to utilizing a vision-language pre-trained model, including classification and generation methods, and we show that the generation method outperforms the classification method for all question types of the challenge. We also compared the methods with other baselines, including another vision-language pre-trained model, and discuss why different vision-language pre-trained models show significant performance gap for the STAR challenge.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the A*STAR - AME Programmatic Funding
Grant Reference no. : A18A2b0046