多模态论文阅读之BLIP - 码农知识堂 - 文章详情页

多模态论文阅读之BLIP
BLIP泛读
Title

BLIP: Bootstrapping Language-Image Pre-training for Uniﬁed Vision-Language Understanding and Generation

Motivation
1. 模型角度：clip albef等要么采用encoder-base model 要么采用encoder-decoder model. However, encoder-based models are less straightforward to directly transfer to text generation tasks(e.g. image captioning), whereas encoder-decoder models have not been sucessfully adopted for image-text retrieval tasks. 那有没有一个统一的框架呢？
2. 数据角度：SOTA的方法（如CLIP、ALBEF等）都在从web上收集到的图文对上进行预训练。尽管通过扩展数据集获得了性能提升，但本文的研究表明，对于视觉语言学习来说，有噪声的网络文本是次优（suboptimal）的。
Contribution
1. Bootstrapping: 从网页上获得了嘈杂的数据集训练一个模型，再通过一些方法获得一个更干净的数据集，能不能训练处一个更好的模型。
2. Unified:caption filter
Model
相关阅读:
【第四周】程序的控制结构
 字符串— trim()、trimStart() 和 trimEnd()
Linux FrameBuffer（三）- struct fb_fix_screeninfo 和 struct fb_var_screeninfo 详解
 金仓数据库KingbaseES物理备份恢复命令选项（stanza-upgrade命令）
基于飞迪RTK/INS组合导航模组的里程计发布方法
 Reference for Ruijie Switch Configuration
MySQL数据库事务控制
 唯一性索引与逻辑删除冲突问题解决思路
 服装服饰小程序商城的作用是什么
 pointwise如何提升网格质量呢
原文地址：https://blog.csdn.net/qq_41825704/article/details/134207687