Effective End-to-End Vision Language Pretraining with Semantic Visual Loss

Page view(s)
14
Checked on Sep 25, 2023
Effective End-to-End Vision Language Pretraining with Semantic Visual Loss
Title:
Effective End-to-End Vision Language Pretraining with Semantic Visual Loss
Journal Title:
IEEE Transactions on Multimedia
Publication Date:
18 January 2023
Citation:
Yang, X., Liu, F., & Lin, G. (2023). Effective End-to-End Vision Language Pretraining with Semantic Visual Loss. IEEE Transactions on Multimedia, 1–10. https://doi.org/10.1109/tmm.2023.3237166
Abstract:
Abstract—Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors. Given their good performance, the extractthen- process pipeline significantly restricts the inference speed and therefore limits their real-world use cases. However, training vision language models from raw image pixels is difficult, as the raw image pixels give much less prior knowledge than region features. In this paper, we systematically study how to leverage auxiliary visual pretraining tasks to help training endto- end vision language models. We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy. Compared with region feature models, our end-to-end models could achieve similar or better performance on downstream tasks and run more than 10 times faster during inference. Compared with other end-to-end models, our proposed method could achieve similar or better performance when pretrained for only 10% of the pretraining GPU hours.
License type:
Publisher Copyright
Funding Info:
This research / project is supported by the AI Singapore - AI Singapore Programme
Grant Reference no. : AISG-RP-2018-003

This research / project is supported by the Ministry of Education - AcRF Tier-1
Grant Reference no. : RG95/20

OPPO research grant.
Description:
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ISSN:
1941-0077
1520-9210
Files uploaded:

File Size Format Action
tmm3237166-amended.pdf 832.48 KB PDF Request a copy