多模态论文阅读之VLMo

Title

CLIP和ALIGN都采用dual-encoder的方式分别编码图像和文本，模态之间的交互采用cosine similarity ，这种方法对retrieval tasks(检索任务)及其有效；但是如此shallow intersection between images and text is not enough to handle complex VL classfication tasks. In ViLT, find that CLIP gives a relatively low accuracy on visual resaoning(VR) task; 后来一系列的tasks，采用的fusion encoder 的方式，即一开始分来images and text 然后采用transformer的encoder 做cross-modal 的intersection，这样的architecture 弥补了dual encoder architecture的drawback，But it requires to jointly encode all possible image-text pairs to compute similarity scores for retrieval tasks. The quadratic time complexity leads to a much slower inference speed than the dual-encoder models models whos time complexity is linear. So, 有没**有一种融合上述两种架构的方法呢？**做检索任务的时候用 dual-encoder架构，做classfication的时候用fusion encoder，所以本文提出了Mixture-of-Modality-Experts
VLMo的训练loss是image-text contrastive(ITC), image-text matching(ITM), masked Language modeling(MLM)和ALBEF是一样的。提出了一个stagewise的预训练方法分别vision 和NLP中的large-scale corpus：首先在vision上训练好，再预训练language experts on text-only data，最后将模型用于vision-language pre-training。

overview of the model

相关阅读:
Git 的原理与使用（上）
3.webpack4初体验（webpack可以处理的文件）
管网数字孪生应用3d场景展示「优势解析」
el-table 列背景色渐变
【C++系列P5】‘类与对象‘-三部曲——[对象&特殊成员](3/3)
动态列排序
Image Animation是什么
贪心——122. 买卖股票的最佳时机 II
黑猫带你学UFS协议栈第10篇：Unipro协议框架详解
C++模板初阶

原文地址：https://blog.csdn.net/qq_41825704/article/details/134206249