Self-Training Vision Language BERTs with a Unified Conditional Model

Page view(s)
Checked on Jul 20, 2024
Self-Training Vision Language BERTs with a Unified Conditional Model
Self-Training Vision Language BERTs with a Unified Conditional Model
Journal Title:
IEEE Transactions on Circuits and Systems for Video Technology
Publication Date:
10 January 2023
Yang, X., Lv, F., Liu, F., & Lin, G. (2023). Self-Training Vision Language BERTs with a Unified Conditional Model. IEEE Transactions on Circuits and Systems for Video Technology, 1–1.
Natural language BERTs are trained with language corpus in a self-supervised manner. Unlike natural language BERTs, vision language BERTs need paired data to train, which restricts the scale of VL-BERT pretraining. We propose a self-training approach that allows training VL-BERTs from unlabeled image data. The proposed method starts with our unified conditional model – a vision language BERT model that can perform zero-shot conditional generation. Given different conditions, the unified conditional model can generate captions, dense captions, and even questions.We use the labeled image data to train a teacher model and use the trained model to generate pseudo captions on unlabeled image data. We then combine the labeled data and pseudo labeled data to train a student model. The process is iterated by putting the student model as a new teacher. By using the proposed self-training approach and only 300k unlabeled extra data, we are able to get competitive or even better performances compared to the models of similar model size trained with 3 million extra image data.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the AI Singapore - AI Singapore Programme
Grant Reference no. : AISG-RP-2018-003

This research / project is supported by the Ministry of Education - AcRF Tier-1
Grant Reference no. : RG95/20

OPPO research grant
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Files uploaded:

File Size Format Action
tcsvt3235704-amended.pdf 931.99 KB PDF Request a copy