• 今日论文阅读2022-11-10


    多模态预训练论文
    ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
    vision-and-language tasks:
    visual question answering,visual commonsense reasoning, referring expressions, and caption-based image retrieval and  a special experiment setting

     

    key technical innovation:
    introducing separate streams for vision and language processing that communicate through co-attentional transformer layers.
    why two-stream?

     

    notes:
    Given an image I represented as a set of region features v 1 , . . . , v T and a text input w 0 , . . . , w T , our model outputs fifinal representations h v 0 , . . . , h v T and  h w 0 , . . . , h wT . Notice that
    exchange between the two streams is restricted to be between specifific layers and that the text stream has signifificantly more processing before interacting with visual features – matching our intuitions that our chosen visual features are already fairly high-level and require
    limited context-aggregation compared to words in a sentence.
     

     

    The first work is over.

     

     

     
    V ISUAL BERT: A Simple And Performant Baseline For Vision And Language
    two visually-grounded language model objectives for pre-training:
    (1) part of the text is masked and the model learns to predict the masked words based on the remaining text and visual context;
    (2) the model is trained to determine whether the provided text matches the image. We
    show that such pre-training on image caption data is important for VisualBERT to learn transferable text and visual representations.
     conduct comprehensive experiments on four vision-and-language tasks:VQA VCR NLVR
    regionto-phrase grounding

     

     

    The second work is over.

    Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

     

    approach

    Pre-training Tasks:MLM MOC VLM
    Fine-tune on Downstream Tasks:Image-Text Retrieval.Visual Commonsense Reasoning.and

     

    The third word is over.

    LXMERT: Learning Cross-Modality Encoder Representations from Transformers
    It consists of three Transformer : encoders: an object relationship encoder, a language encoder, and across-modality encoder.

     

    pre-train our model with fifive diverse representative tasks:
    (1) masked cross modality language modeling
    (2) masked object prediction via RoI-feature regression
    (3) masked object prediction via detected-label classifification,
    (4) cross-modality matching
    (5) image question answering.

     

    over

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

  • 相关阅读:
    auto关键字(C++11)
    「PAT乙级真题解析」Basic Level 1093 字符串A+B (问题分析+完整步骤+伪代码描述+提交通过代码)
    每日一题-6083. 判断一个数的数字计数是否等于数位的值
    以任意位置中间元素翻转字符串:
    GCC编译过程
    传统算法与神经网络算法,神经网络运用的算法有
    boost::asio::ip::tcp::acceptor::async_accept 一直被死循环调用(无错误)问题的处理。
    还在到处找图片和封面?是时候了解下这些网站了
    Ubuntu上安装MySQL
    [Err] 1093 - You can‘t specify target table ‘*****‘ for update in FROM clause
  • 原文地址:https://blog.csdn.net/huihuixiaoxue/article/details/127768088