• 今日论文阅读2022-11-10


    多模态预训练论文
    ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
    vision-and-language tasks:
    visual question answering,visual commonsense reasoning, referring expressions, and caption-based image retrieval and  a special experiment setting

     

    key technical innovation:
    introducing separate streams for vision and language processing that communicate through co-attentional transformer layers.
    why two-stream?

     

    notes:
    Given an image I represented as a set of region features v 1 , . . . , v T and a text input w 0 , . . . , w T , our model outputs fifinal representations h v 0 , . . . , h v T and  h w 0 , . . . , h wT . Notice that
    exchange between the two streams is restricted to be between specifific layers and that the text stream has signifificantly more processing before interacting with visual features – matching our intuitions that our chosen visual features are already fairly high-level and require
    limited context-aggregation compared to words in a sentence.
     

     

    The first work is over.

     

     

     
    V ISUAL BERT: A Simple And Performant Baseline For Vision And Language
    two visually-grounded language model objectives for pre-training:
    (1) part of the text is masked and the model learns to predict the masked words based on the remaining text and visual context;
    (2) the model is trained to determine whether the provided text matches the image. We
    show that such pre-training on image caption data is important for VisualBERT to learn transferable text and visual representations.
     conduct comprehensive experiments on four vision-and-language tasks:VQA VCR NLVR
    regionto-phrase grounding

     

     

    The second work is over.

    Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

     

    approach

    Pre-training Tasks:MLM MOC VLM
    Fine-tune on Downstream Tasks:Image-Text Retrieval.Visual Commonsense Reasoning.and

     

    The third word is over.

    LXMERT: Learning Cross-Modality Encoder Representations from Transformers
    It consists of three Transformer : encoders: an object relationship encoder, a language encoder, and across-modality encoder.

     

    pre-train our model with fifive diverse representative tasks:
    (1) masked cross modality language modeling
    (2) masked object prediction via RoI-feature regression
    (3) masked object prediction via detected-label classifification,
    (4) cross-modality matching
    (5) image question answering.

     

    over

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

  • 相关阅读:
    flink-sql所有语法详解-1.14
    YOLOV5----修改损失函数-ShuffleAttention
    SpringFramewrok (1)
    【微信小程序】全局配置
    【后端的讲解】
    ARMv7-A 那些事 - 4.处理器模式与特权等级
    移动Web第三天 1 移动端特点 && 2 百分比布局 && 3 Flex布局
    Ceph入门到精通-Linux内核网络参数优化小结
    【LoadRunner】博客笔记项目 性能测试报告
    分布式全局唯一ID (学习总结---从入门到深化)
  • 原文地址:https://blog.csdn.net/huihuixiaoxue/article/details/127768088