VL-BERT: PRE-TRAINING OF GENERIC VISUAL LINGUISTIC REPRESENTATIONS
pre-training
fine-tuning
UNITER: UNiversal Image-TExt
Representation Learning
can we learn a universal image-text representation for all V+L tasks?
InterBERT: Vision-and-Language Interaction for Multi-modal
Pretraining
Oscar: Object-Semantics Aligned Pre-training
for Vision-Language Tasks
Large-Scale Adversarial Training for
Vision-and-Language Representation Learning