入门小菜鸟,希望像做笔记记录自己学的东西,也希望能帮助到同样入门的人,更希望大佬们帮忙纠错啦~侵权立删。
目录

输入层:hidden state,shape为(b, s, h);最终输出也是(b,s,h);
若输入有记忆层,则对他进行LayerNorm处理,并送进self attention中发挥作用;
简单而言,LayerNorm就是针对每条样本,对每条样本的所有特征做归一化。为的是让当前层的参数稳定下来,避免梯度消失或者梯度爆炸,方便后面的继续学习。
公式:(其中
防止分母为0)
![y=\frac{x-E[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}}](https://1000bd.com/contentImg/2023/10/27/140314896.png)
详见CogView中的Self attention_tt丫的博客-CSDN博客
可以看看深度学习之Resnet详解|CSDN创作打卡_tt丫的博客-CSDN博客_resnet学习中对残差结构的介绍。
用于解决深度网络的退化问题。
- class GPT2ParallelTransformerLayer(torch.nn.Module):
- """A single layer transformer for GPT2.
- We use the following notation:
- h: hidden size
- n: number of attention heads
- b: batch size
- s: sequence length
- Transformore layer takes input with size [b, s, h] and returns an
- output of the same size.
- Arguments:
- hidden_size: The hidden size of the self attention.
- num_attention_heads: number of attention head in the self
- attention.
- attention_dropout_prob: dropout probability of the attention
- score in self attention.
- output_dropout_prob: dropout probability for the outputs
- after self attention and final output.
- layernorm_epsilon: epsilon used in layernorm to avoid
- division by zero.
- init_method: initialization method used for the weights. Note
- that all biases are initialized to zero and
- layernorm weight are initialized to one.
- output_layer_init_method: output layers (attention output and
- mlp output) initialization. If None,
- use `init_method`.
- """
- def __init__(self,
- hidden_size,
- num_attention_heads,
- attention_dropout_prob,
- output_dropout_prob,
- layernorm_epsilon,
- init_method,
- output_layer_init_method=None,
- query_window=128,
- key_window_times=6,
- scale_normalization=True
- ):
- super(GPT2ParallelTransformerLayer, self).__init__()
对输出层:若output_layer_init_method为None则对输出层权重用init_method方法初始化;
对输入层:调用LayerNorm类对其进行LayerNorm处理;
- # Set output layer initialization if not provided.
- if output_layer_init_method is None:
- output_layer_init_method = init_method
-
- # Layernorm on the input data.
- self.input_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon)
LayerNorm类——对网络权重参数进行类归一化并且进一步变小
- #对输入的值除以输入中最大绝对值的1/8
- class LayerNorm(FusedLayerNorm):
- def __init__(self, *args, **kwargs):
- super().__init__(*args, **kwargs)
- def forward(self, x):
- return super().forward(x / (x.abs().max().detach()/8))
这里引用了apex.normalization.fused_layer_norm中的FusedLayerNorm类,然后forward再对输入的所有值除以输入中最大绝对值的1/8(具体解释见apex.normalization.fused_layer_norm — Apex 0.1.0 documentation)

调用GPT2ParallelSelfAttention类进行初始化
- # Self attention.
- self.attention = GPT2ParallelSelfAttention(
- hidden_size,
- num_attention_heads,
- attention_dropout_prob,
- output_dropout_prob,
- init_method,
- output_layer_init_method=output_layer_init_method,
- query_window=query_window,
- key_window_times=key_window_times)
- # Layernorm on the input data.
- self.post_attention_layernorm = LayerNorm(hidden_size,
- eps=layernorm_epsilon)#调用LayerNorm进行权重值的约束
- self.scale_normalization = scale_normalization
- #是否约束第三,四层网络层的权重
- if scale_normalization:
- self.third_layernorm = LayerNorm(hidden_size,
- eps=layernorm_epsilon)
- self.fourth_layernorm = LayerNorm(hidden_size,
- eps=layernorm_epsilon)
- # MLP
- self.mlp = GPT2ParallelMLP(
- hidden_size,
- output_dropout_prob,
- init_method,
- output_layer_init_method=output_layer_init_method)
- def forward(self, hidden_states, ltor_mask, pivot_idx=None, is_sparse=0, mem=None):
- # hidden_states: [b, s, h] 上一层的输出作为这一层的输入
- # ltor_mask: [1, 1, s, s] attention mask矩阵
- # Layer norm at the begining of the transformer layer.
- layernorm_output1 = self.input_layernorm(hidden_states)
- mem = self.input_layernorm(mem) if mem is not None else None#对记忆模块也进行LayerNorm处理
输出大小[b, s, h]
- # Self attention.
- attention_output = self.attention(layernorm_output1, ltor_mask, pivot_idx, is_sparse, mem)
判断是否对第二层网络层输出的权重进行LayerNorm处理,形成第三层
- # Third LayerNorm
- if self.scale_normalization:
- attention_output = self.third_layernorm(attention_output)
首先构建残差网络:将输出和输入相加作为 output
- # Residual connection.
- layernorm_input = hidden_states + attention_output
再对其做LayerNorm处理
- # Layer norm post the self attention.
- layernorm_output = self.post_attention_layernorm(layernorm_input)
最后来一波MLP,即做非线性运算变幻,h -> 4*h -> h
- # MLP.
- mlp_output = self.mlp(layernorm_output)
判断是否对第四层网络层的权重进行LayerNorm处理
- # Fourth LayerNorm
- if self.scale_normalization:
- mlp_output = self.fourth_layernorm(mlp_output)
- # Second residual connection.
- output = layernorm_input + mlp_output
-
- return output
欢迎大家在评论区批评指正,谢谢大家~