CNN:
RNN:依次有序递归建模
Transformer:

Encoder和Decoder模式
位置信息逐层传递
Transformer:
中文翻译成英文,则encoder是中文,decoder是英文
PyTorch实现Transformer,源码torch.nn.Transformer:
输入参数:
d_model:Transformer的输入维度nhead:multi-head self-attention,头的数目num_encoder_layers:encoder的block数量num_decoder_layers:decoder的block数量dim_feedforward:feed-forward的输出维度encoder_layer x num_encoder_layers + encoder_norm -> encoder
decoder_layer x num_decoder_layers + decoder_norm -> decoder
encoder输入:src、mask、src_key_padding_mask
decoder输入:tgt、memory、tgt_mask、memory_mask、tgt_key_padding_mask、memory_key_padding_mask
模型预测流程
一起输入Decoder进行预测模型训练流程
训练流程和预测流程稍有不同。在训练时,如果每次将预测结果输入还没有训练好的模型会让输出结果越走越偏。因此在训练时采用了”Teacher Forcing“技巧,不管模型输出的结果是什么,每次将正确的输出结果作为Decoder的输入继续预测。
位置编码示例

TransformerEncoderLayer继承于Module,被TransformerEncoder调用:
self.self_attn self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
**factory_kwargs)
# Implementation of Feedforward model
self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)
self.norm_first = norm_first
self.norm1 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.norm2 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.dropout1 = Dropout(dropout)
self.dropout2 = Dropout(dropout)
if isinstance(activation, str):
activation = _get_activation_fn(activation)
forward:
_sa_block:自注意力模块_ff_block:前馈模块 x = src
if self.norm_first:
x = x + self._sa_block(self.norm1(x), src_mask, src_key_padding_mask)
x = x + self._ff_block(self.norm2(x))
else:
x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask))
x = self.norm2(x + self._ff_block(x))
TransformerEncoder,继承于Module
num_layers次encoder_layer,encoder_layer就是TransformerEncoderLayer_get_clones是copy.deepcopy()self.layers = _get_clones(encoder_layer, num_layers)
self.num_layers = num_layers
def _get_clones(module, N):
return ModuleList([copy.deepcopy(module) for i in range(N)])
forward:
convert_to_nested加速TransformerEncoderLayer,output进output出,循环调用 for mod in self.layers:
if convert_to_nested:
output = mod(output, src_mask=mask)
else:
output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)
if convert_to_nested:
output = output.to_padded_tensor(0.)
if self.norm is not None:
output = self.norm(output)
return output
TransformerDecoderLayer,继承于Module,与TransformerEncoderLayer类似
self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first, **factory_kwargs)
self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first, **factory_kwargs)
# Implementation of Feedforward model
self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)
self.norm_first = norm_first
self.norm1 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.norm2 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.norm3 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
self.dropout1 = Dropout(dropout)
self.dropout2 = Dropout(dropout)
self.dropout3 = Dropout(dropout)
forward
_sa_block -> _mha_block -> _ff_block,所以需要3组LayerNorm和Dropout x = tgt
if self.norm_first:
x = x + self._sa_block(self.norm1(x), tgt_mask, tgt_key_padding_mask)
x = x + self._mha_block(self.norm2(x), memory, memory_mask, memory_key_padding_mask)
x = x + self._ff_block(self.norm3(x))
else:
x = self.norm1(x + self._sa_block(x, tgt_mask, tgt_key_padding_mask))
x = self.norm2(x + self._mha_block(x, memory, memory_mask, memory_key_padding_mask))
x = self.norm3(x + self._ff_block(x))
_sa_block和_mha_block的query、key、value不同
_sa_block中,qkv都是x_mha_block中,q是x,kv都是mem,即encoder的输入# MultiheadAttention
def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: Optional[Tensor] = None, need_weights: bool = True, attn_mask: Optional[Tensor] = None, average_attn_weights: bool = True) -> Tuple[Tensor, Optional[Tensor]]:
def _sa_block(self, x: Tensor,
attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
x = self.self_attn(x, x, x,
attn_mask=attn_mask,
key_padding_mask=key_padding_mask,
need_weights=False)[0]
return self.dropout1(x)
# multihead attention block
def _mha_block(self, x: Tensor, mem: Tensor,
attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
x = self.multihead_attn(x, mem, mem,
attn_mask=attn_mask,
key_padding_mask=key_padding_mask,
need_weights=False)[0]
return self.dropout2(x)
QKV的公式:

TransformerDecoder
for mod in self.layers:
output = mod(output, memory, tgt_mask=tgt_mask,
memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask)
if self.norm is not None:
output = self.norm(output)
MultiheadAttention,参考 Harvard NLP 的 The Annotated Transformer

Scaled Dot-Product Attention,Softmax的分布更加稳定,方差更小,Q和K做矩阵内积。
scores.masked_fill(mask == 0, -1e9),填充0的操作def attention(query, key, value, mask=None, dropout=None):
"Compute 'Scaled Dot Product Attention'"
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) \
/ math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = F.softmax(scores, dim = -1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn
Decoder的Mask:
def subsequent_mask(size):
"Mask out subsequent positions."
attn_shape = (1, size, size)
subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
return torch.from_numpy(subsequent_mask) == 0
plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])
None
