A Transformer Network processes sentences from left to right, one word at a time.
False
True
Transformer Network methodology is taken from:
GRUs and LSTMs
Attention Mechanism and RNN style of processing.
Attention Mechanism and CNN style of processing.
RNN and LSTMs
**What are the key inputs to computing the attention value for each word? **
The key inputs to computing the attention value for each word are called the query, knowledge, and vector.
The key inputs to computing the attention value for each word are called the query, key, and value.
The key inputs to computing the attention value for each word are called the quotation, key, and vector.
The key inputs to computing the attention value for each word are called the quotation, knowledge, and value.
解析:The key inputs to computing the attention value for each word are called the query, key, and value.
Which of the following correctly represents Attention ?
A
t
t
e
n
t
i
o
n
(
Q
,
K
,
V
)
=
s
o
f
t
m
a
x
(
Q
K
T
d
k
)
V
Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_k}})V
Attention(Q,K,V)=softmax(dkQKT)V
A
t
t
e
n
t
i
o
n
(
Q
,
K
,
V
)
=
s
o
f
t
m
a
x
(
Q
V
T
d
k
)
K
Attention(Q,K,V)=softmax(\frac{QV^{T}}{\sqrt{d_k}})K
Attention(Q,K,V)=softmax(dkQVT)K
A
t
t
e
n
t
i
o
n
(
Q
,
K
,
V
)
=
m
i
n
(
Q
K
T
d
k
)
V
Attention(Q,K,V)=min(\frac{QK^{T}}{\sqrt{d_k}})V
Attention(Q,K,V)=min(dkQKT)V
A
t
t
e
n
t
i
o
n
(
Q
,
K
,
V
)
=
m
i
n
(
Q
V
T
d
k
)
K
Attention(Q,K,V)=min(\frac{QV^{T}}{\sqrt{d_k}})K
Attention(Q,K,V)=min(dkQVT)K
Are the following statements true regarding Query (Q), Key (K) and Value (V)? Q = interesting questions about the words in a sentence K = specific representations of words given a Q V = qualities of words given a Q
False
True
解析:Q = interesting questions about the words in a sentence, K = qualities of words given a Q, V = specific representations of words given a Q
i here represents the computed attention weight matrix associated with the
i
t
h
ith
ith “word” in a sentence
False
True
解析:
i
i
i here represents the computed attention weight matrix associated with the
i
t
h
ith
ith “head” (sequence).
Following is the architecture within a Transformer Network (without displaying positional encoding and output layers(s)).
What is generated from the output of the Decoder’s first block of Multi-Head Attention?
Q
K
V
解析:This first block’s output is used to generate the Q matrix for the next Multi-Head Attention block.
Following is the architecture within a Transformer Network. (without displaying positional encoding and output layers(s))
What is the output layer(s) of the Decoder ? (Marked
Y
Y
Y, pointed by the independent arrow)
Softmax layer
Linear layer
Linear layer followed by a softmax layer.
Softmax layer followed by a linear layer.
Which of the following statements is true about positional encoding? Select all that apply.
Positional encoding is important because position and word order are essential in sentence construction of any language.
解析:This is a correct answer, but other options are also correct. To review the concept watch the lecture Transformer Network.
Positional encoding uses a combination of sine and cosine equations.
解析This is a correct answer, but other options are also correct. To review the concept watch the lecture Transformer Network.
Positional encoding is used in the transformer network and the attention model.
Positional encoding provides extra information to our model.
Which of these is a good criterion for a good positionial encoding algorithm?
The algorithm should be able to generalize to longer sentences.
Distance between any two time-steps should be inconsistent for all sentence lengths.
It must be nondeterministic.
It should output a common encoding for each time-step (word’s position in a sentence).