• 深度学习Course5第四周Transformers习题整理


    1. A Transformer Network processes sentences from left to right, one word at a time.
    • False
    • True
    1. Transformer Network methodology is taken from:
    • GRUs and LSTMs
    • Attention Mechanism and RNN style of processing.
    • Attention Mechanism and CNN style of processing.
    • RNN and LSTMs
    1. **What are the key inputs to computing the attention value for each word? **
      在这里插入图片描述
    • The key inputs to computing the attention value for each word are called the query, knowledge, and vector.
    • The key inputs to computing the attention value for each word are called the query, key, and value.
    • The key inputs to computing the attention value for each word are called the quotation, key, and vector.
    • The key inputs to computing the attention value for each word are called the quotation, knowledge, and value.

    解析:The key inputs to computing the attention value for each word are called the query, key, and value.

    1. Which of the following correctly represents Attention ?
    • A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dk QKT)V
    • A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q V T d k ) K Attention(Q,K,V)=softmax(\frac{QV^{T}}{\sqrt{d_k}})K Attention(Q,K,V)=softmax(dk QVT)K
    • A t t e n t i o n ( Q , K , V ) = m i n ( Q K T d k ) V Attention(Q,K,V)=min(\frac{QK^{T}}{\sqrt{d_k}})V Attention(Q,K,V)=min(dk QKT)V
    • A t t e n t i o n ( Q , K , V ) = m i n ( Q V T d k ) K Attention(Q,K,V)=min(\frac{QV^{T}}{\sqrt{d_k}})K Attention(Q,K,V)=min(dk QVT)K
    1. Are the following statements true regarding Query (Q), Key (K) and Value (V)?
      Q = interesting questions about the words in a sentence
      K = specific representations of words given a Q
      V = qualities of words given a Q
    • False
    • True

    解析:Q = interesting questions about the words in a sentence, K = qualities of words given a Q, V = specific representations of words given a Q

    在这里插入图片描述
    i here represents the computed attention weight matrix associated with the i t h ith ith “word” in a sentence

    • False
    • True

    解析: i i i here represents the computed attention weight matrix associated with the i t h ith ith “head” (sequence).

    1. Following is the architecture within a Transformer Network (without displaying positional encoding and output layers(s)).
      在这里插入图片描述
      What is generated from the output of the Decoder’s first block of Multi-Head Attention?
    • Q
    • K
    • V

    解析:This first block’s output is used to generate the Q matrix for the next Multi-Head Attention block.

    1. Following is the architecture within a Transformer Network. (without displaying positional encoding and output layers(s))
      在这里插入图片描述
      What is the output layer(s) of the Decoder ? (Marked Y Y Y, pointed by the independent arrow)
    • Softmax layer
    • Linear layer
    • Linear layer followed by a softmax layer.
    • Softmax layer followed by a linear layer.
    1. Which of the following statements is true about positional encoding? Select all that apply.
    • Positional encoding is important because position and word order are essential in sentence construction of any language.

    解析:This is a correct answer, but other options are also correct. To review the concept watch the lecture Transformer Network.

    • Positional encoding uses a combination of sine and cosine equations.

    解析This is a correct answer, but other options are also correct. To review the concept watch the lecture Transformer Network.

    • Positional encoding is used in the transformer network and the attention model.
    • Positional encoding provides extra information to our model.
    1. Which of these is a good criterion for a good positionial encoding algorithm?
    • The algorithm should be able to generalize to longer sentences.
    • Distance between any two time-steps should be inconsistent for all sentence lengths.
    • It must be nondeterministic.
    • It should output a common encoding for each time-step (word’s position in a sentence).
  • 相关阅读:
    【Linux】日志 日志管理服务 日志轮替
    简单聊聊设备指纹设计
    基于51单片机LCD声光音乐盒
    在Vue中使用列表渲染v-for时为什么不要用index作为key?
    4.1 设计模式_单例模式
    石家庄正定县恢复种植 国稻种芯·中国水稻节:河北绘就画卷
    渗透测试之——信息收集思路
    手把手教你解决循环依赖,一步一步地来窥探出三级缓存的奥秘
    《排序和数据结构学习笔记》看完直呼,太全了!
    Design Pattern (GoF)
  • 原文地址:https://blog.csdn.net/l8947943/article/details/126919395