Comparing Classification and Generation Approaches to Situated Reasoning with Vision-language Pre-trained Models

Page view(s)

119

Checked on Aug 27, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/18771

Title:

Comparing Classification and Generation Approaches to Situated Reasoning with Vision-language Pre-trained Models

Journal Title:

European Conference on Computer Vision - Machine Visual Common Sense (ECCV-MVCS) workshop (2022)

DOI:

Publication URL:

https://eccv2022.ecva.net/

Authors:

Xin Huang, Hui Li Tan, Jung Jae Kim

Keywords:

multimodal learning, Visual Question Answering

Publication Date:

23 October 2022

Citation:

Abstract:

Situated Reasoning in Real-World Videos (STAR) is a new benchmark for evaluating situated reasoning ability through situation abstraction and logic-grounded question answering on real-world videos. In this paper, we present our submission to the STAR challenge which achieves the top-1 result for the situated question reasoning. We investigated two approaches to utilizing a vision-language pre-trained model, including classification and generation methods, and we show that the generation method outperforms the classification method for all question types of the challenge. We also compared the methods with other baselines, including another vision-language pre-trained model, and discuss why different vision-language pre-trained models show significant performance gap for the STAR challenge.

License type:

Publisher Copyright

Funding Info:

This research / project is supported by the A*STAR - AME Programmatic Funding
Grant Reference no. : A18A2b0046

Description:

URI:

https://oar.a-star.edu.sg/communities-collections/articles/18771

ISBN:

ECCV_2022_MVCS_STAR_challenge_final

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
eccv-2022-mvcs-star-challenge-camera-ready.pdf	220.62 KB	PDF	Open