本文致力于进行多粒度(objects, regions, and images)的视觉文本对齐预训练任务;
2. 模型结构
3. 损失函数
3.1 contrastive loss
文本特征和视觉特征之间的相似性定义:
3. vision-to-text similarity
4. text-to-vision similarity
5. GT:one-hot
6. cross-entropy loss
3.2 matching loss
For each visual concept in a mini-batch, we sample an in-batch hard negative text by following
p
v
2
t
(
V
)
p^{v2t}(V)
pv2t(V). (与当前视觉特征越接近的文本越可能被采样)
We also sample one hard negative visual concept for each text.
put the pairs as inputs for the fusion module, and then we use xcls, the output [CLS] embedding of the fusion module, to predict the matching probability
p
m
a
t
c
h
p^{match}
pmatch , and the loss is: