Comparing Classification and Generation Approaches to Situated Reasoning with Vision-language Pre-trained Models

Page view(s)
89
Checked on Aug 22, 2024
Comparing Classification and Generation Approaches to Situated Reasoning with Vision-language Pre-trained Models
Title:
Comparing Classification and Generation Approaches to Situated Reasoning with Vision-language Pre-trained Models
Journal Title:
European Conference on Computer Vision - Machine Visual Common Sense (ECCV-MVCS) workshop (2022)
DOI:
Publication Date:
23 October 2022
Citation:
0
Abstract:
Situated Reasoning in Real-World Videos (STAR) is a new benchmark for evaluating situated reasoning ability through situation abstraction and logic-grounded question answering on real-world videos. In this paper, we present our submission to the STAR challenge which achieves the top-1 result for the situated question reasoning. We investigated two approaches to utilizing a vision-language pre-trained model, including classification and generation methods, and we show that the generation method outperforms the classification method for all question types of the challenge. We also compared the methods with other baselines, including another vision-language pre-trained model, and discuss why different vision-language pre-trained models show significant performance gap for the STAR challenge.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the A*STAR - AME Programmatic Funding
Grant Reference no. : A18A2b0046
Description:
ISBN:
ECCV_2022_MVCS_STAR_challenge_final
Files uploaded:

File Size Format Action
eccv-2022-mvcs-star-challenge-camera-ready.pdf 220.62 KB PDF Open