MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

Page view(s)

Checked on Sep 09, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/21916

Title:

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

Journal Title:

Computer Vision and Pattern Recognition Conference (CVPR)

DOI:

10.1109/CVPR52734.2025.02769

Publication URL:

https://doi.org/10.1109/CVPR52734.2025.02769

Authors:

Ziyang Zhang, Yang Yu, Yucheng Chen, Xulei Yang, Si Yong Yeo

Keywords:

deep learning, multimodal learning, Vision language model

Publication Date:

13 August 2025

Citation:

Z. Zhang, Y. Yu, Y. Chen, X. Yang and S. Y. Yeo, "MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations," 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2025, pp. 29744-29755, doi: 10.1109/CVPR52734.2025.02769.

Abstract:

Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose MedUnifier, a unified VLP framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks, cross-modal tasks, and multi-modal tasks, where it achieves state-of-the-art performance across various tasks. MedUnifier also offers a highly adaptable tool for a wide range of language and vision tasks in healthcare, marking advancement toward the development of a generalizable AI model for medical applications.

License type:

Publisher Copyright

Funding Info:

This research / project is supported by the Ministry of Education - Academic Research Fund Tier 1
Grant Reference no. : RG25/24 and RS16/23

This research / project is supported by the Ministry of Education - Start-Up Grant
Grant Reference no. : NA

Description:

© 2025 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

URI:

https://oar.a-star.edu.sg/communities-collections/articles/21916

ISSN:

1063-6919

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
zhang-medunifier-unifying-vision-and-language-pre-training-on-medical-data-with-vision-generation-cvpr-2025-paper.pdf	938.67 KB	PDF	Request a copy