Yang, X., Liu, F., & Lin, G. (2023). Effective End-to-End Vision Language Pretraining with Semantic Visual Loss. IEEE Transactions on Multimedia, 1–10. https://doi.org/10.1109/tmm.2023.3237166
Abstract:
Abstract—Current vision language pretraining models are
dominated by methods using region visual features extracted
from object detectors. Given their good performance, the extractthen-
process pipeline significantly restricts the inference speed
and therefore limits their real-world use cases. However, training
vision language models from raw image pixels is difficult, as
the raw image pixels give much less prior knowledge than
region features. In this paper, we systematically study how to
leverage auxiliary visual pretraining tasks to help training endto-
end vision language models. We introduce three types of visual
losses that enable much faster convergence and better finetuning
accuracy. Compared with region feature models, our end-to-end
models could achieve similar or better performance on downstream
tasks and run more than 10 times faster during inference.
Compared with other end-to-end models, our proposed method
could achieve similar or better performance when pretrained for
only 10% of the pretraining GPU hours.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the AI Singapore - AI Singapore Programme
Grant Reference no. : AISG-RP-2018-003
This research / project is supported by the Ministry of Education - AcRF Tier-1
Grant Reference no. : RG95/20