Yang, X., Lv, F., Liu, F., & Lin, G. (2023). Self-Training Vision Language BERTs with a Unified Conditional Model. IEEE Transactions on Circuits and Systems for Video Technology, 1–1. https://doi.org/10.1109/tcsvt.2023.3235704
Abstract:
Natural language BERTs are trained with language
corpus in a self-supervised manner. Unlike natural language
BERTs, vision language BERTs need paired data to train,
which restricts the scale of VL-BERT pretraining. We propose
a self-training approach that allows training VL-BERTs from
unlabeled image data. The proposed method starts with our
unified conditional model – a vision language BERT model that
can perform zero-shot conditional generation. Given different
conditions, the unified conditional model can generate captions,
dense captions, and even questions.We use the labeled image data
to train a teacher model and use the trained model to generate
pseudo captions on unlabeled image data. We then combine the
labeled data and pseudo labeled data to train a student model.
The process is iterated by putting the student model as a new
teacher. By using the proposed self-training approach and only
300k unlabeled extra data, we are able to get competitive or even
better performances compared to the models of similar model size
trained with 3 million extra image data.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the AI Singapore - AI Singapore Programme
Grant Reference no. : AISG-RP-2018-003
This research / project is supported by the Ministry of Education - AcRF Tier-1
Grant Reference no. : RG95/20