Enhancing few-shot KB-VQA with panoramic image captions guided by Large Language Models

Page view(s)
6
Checked on Aug 31, 2025
Enhancing few-shot KB-VQA with panoramic image captions guided by Large Language Models
Title:
Enhancing few-shot KB-VQA with panoramic image captions guided by Large Language Models
Journal Title:
Neurocomputing
Keywords:
Publication Date:
14 January 2025
Citation:
Qiang, P., Tan, H., Li, X., Wang, D., Li, R., Sun, X., Zhang, H., & Liang, J. (2025). Enhancing few-shot KB-VQA with panoramic image captions guided by Large Language Models. Neurocomputing, 623, 129373. https://doi.org/10.1016/j.neucom.2025.129373
Abstract:
Current state-of-the-art (SOTA) KB-VQA techniques involve transforming images into image captions as prompts to harness the potent reasoning capabilities of large language models (LLMs) for generating answers. However, generic image captions often fall short in capturing crucial visual details, essential for LLMs to deliver precise responses. To address this challenge, we propose an image captioning model that effectively utilizes a set of visual language models, such as BLIP2, GRiT, OCR, etc., to extract rich visual information from images. Subsequently, we employ the inferential and summarization capabilities of LLM to generate panoramic image descriptions enriched with intricate details. Simultaneously, we employ Contextual Constraint Examples and Constraint Instruction to mitigate the potential hallucination issues arising from LLM-generated image captions. Extensive experiments validate the superiority and scalability of our proposed method, achieving significant improvements over SOTA methods in challenging few-shot settings. For instance, on the challenging OK-VQA, our method outperforms PICa by 6.5%. On the VQAv2 dataset, our method surpasses the SOTA approach by 5.4%.
License type:
Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Funding Info:
This research / project is supported by the National Natural Science Foundation of China - NA
Grant Reference no. : No. 62076155, No. 62176145

This research / project is supported by the The Science and Technology Cooperation and Exchange Special Project of ShanXi Province - NA
Grant Reference no. : 202204041101016
Description:
ISSN:
0925-2312
Files uploaded:

File Size Format Action
enhancing-few-shot-kb-vqa-with-panoramic-image.pdf 1.06 MB PDF Request a copy