Enhancing few-shot KB-VQA with panoramic image captions guided by Large Language Models

Page view(s)

Checked on Aug 31, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/22064

Title:

Enhancing few-shot KB-VQA with panoramic image captions guided by Large Language Models

Journal Title:

Neurocomputing

DOI:

10.1016/j.neucom.2025.129373

Publication URL:

https://doi.org/10.1016/j.neucom.2025.129373

Authors:

Pengpeng Qiang, Hongye Tan, Xiaoli Li, Dian Wang, Ru Li, Xinyi Sun, Hu Zhang, Jiye Liang

Keywords:

Publication Date:

14 January 2025

Citation:

Qiang, P., Tan, H., Li, X., Wang, D., Li, R., Sun, X., Zhang, H., & Liang, J. (2025). Enhancing few-shot KB-VQA with panoramic image captions guided by Large Language Models. Neurocomputing, 623, 129373. https://doi.org/10.1016/j.neucom.2025.129373

Abstract:

Current state-of-the-art (SOTA) KB-VQA techniques involve transforming images into image captions as prompts to harness the potent reasoning capabilities of large language models (LLMs) for generating answers. However, generic image captions often fall short in capturing crucial visual details, essential for LLMs to deliver precise responses. To address this challenge, we propose an image captioning model that effectively utilizes a set of visual language models, such as BLIP2, GRiT, OCR, etc., to extract rich visual information from images. Subsequently, we employ the inferential and summarization capabilities of LLM to generate panoramic image descriptions enriched with intricate details. Simultaneously, we employ Contextual Constraint Examples and Constraint Instruction to mitigate the potential hallucination issues arising from LLM-generated image captions. Extensive experiments validate the superiority and scalability of our proposed method, achieving significant improvements over SOTA methods in challenging few-shot settings. For instance, on the challenging OK-VQA, our method outperforms PICa by 6.5%. On the VQAv2 dataset, our method surpasses the SOTA approach by 5.4%.

License type:

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Funding Info:

This research / project is supported by the National Natural Science Foundation of China - NA
Grant Reference no. : No. 62076155, No. 62176145

This research / project is supported by the The Science and Technology Cooperation and Exchange Special Project of ShanXi Province - NA
Grant Reference no. : 202204041101016

Description:

URI:

https://oar.a-star.edu.sg/communities-collections/articles/22064

ISSN:

0925-2312

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
enhancing-few-shot-kb-vqa-with-panoramic-image.pdf	1.06 MB	PDF	Request a copy