CogView中的单层TransformerLayer


class GPT2ParallelTransformerLayer(torch.nn.Module):
    """A single layer transformer for GPT2.
    We use the following notation:
        h: hidden size
        n: number of attention heads
        b: batch size
        s: sequence length
    Transformore layer takes input with size [b, s, h] and returns an
    output of the same size.
    Arguments:
        hidden_size: The hidden size of the self attention.
        num_attention_heads: number of attention head in the self
                             attention.
        attention_dropout_prob: dropout probability of the attention
                                score in self attention.
        output_dropout_prob: dropout probability for the outputs
                             after self attention and final output.
        layernorm_epsilon: epsilon used in layernorm to avoid
                           division by zero.
        init_method: initialization method used for the weights. Note
                     that all biases are initialized to zero and
                     layernorm weight are initialized to one.
        output_layer_init_method: output layers (attention output and
                                  mlp output) initialization. If None,
                                  use `init_method`.
    """
    def __init__(self,
                 hidden_size,
                 num_attention_heads,
                 attention_dropout_prob,
                 output_dropout_prob,
                 layernorm_epsilon,
                 init_method,
                 output_layer_init_method=None,
                 query_window=128,
                 key_window_times=6,
                 scale_normalization=True
                 ):
        super(GPT2ParallelTransformerLayer, self).__init__()

hidden_size：自我注意力模块的隐藏大小（嵌入向量的维度）；
num_attention_heads：自我注意力模块中attention head的数量；
attention_dropout_prob：注意力模块中注意力得分被dropout的概率；
output_dropout_prob：输出层后的输出被dropout的概率；
layernorm_epsilon：在layernform中用于避免被零除的ε（用于防止分母为0）；
init_method：用于权重的初始化方法（所有偏差均初始化为0，分层形式权重初始化为1）；
output_layer_init_method：输出层（注意力输出和mlp输出）初始化方法定义。若为None，则用init_method方法；
query_window：稀疏处理中的滑动窗口大小；
key_window_times：可用于稀疏处理中窗口数目；
scale_normalization：是否调用LayerNorm类给第3，4层网络的参数进行一个缩放处理（为了让权重小一点）；

（2）对输出输入层

对输出层：若output_layer_init_method为None则对输出层权重用init_method方法初始化；

对输入层：调用LayerNorm类对其进行LayerNorm处理；


        # Set output layer initialization if not provided.
        if output_layer_init_method is None:
            output_layer_init_method = init_method
 
        # Layernorm on the input data.
        self.input_layernorm = LayerNorm(hidden_size, eps=layernorm_epsilon)

LayerNorm类——对网络权重参数进行类归一化并且进一步变小


#对输入的值除以输入中最大绝对值的1/8
class LayerNorm(FusedLayerNorm):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
    def forward(self, x):
        return super().forward(x / (x.abs().max().detach()/8))

这里引用了apex.normalization.fused_layer_norm中的FusedLayerNorm类，然后forward再对输入的所有值除以输入中最大绝对值的1/8（具体解释见apex.normalization.fused_layer_norm — Apex 0.1.0 documentation）

（3）Self attention模块初始化

调用GPT2ParallelSelfAttention类进行初始化


        # Self attention.
        self.attention = GPT2ParallelSelfAttention(
            hidden_size,
            num_attention_heads,
            attention_dropout_prob,
            output_dropout_prob,
            init_method,
            output_layer_init_method=output_layer_init_method,
            query_window=query_window,
            key_window_times=key_window_times)

（4）约束的方法定义——LayerNorm


        # Layernorm on the input data.
        self.post_attention_layernorm = LayerNorm(hidden_size,
                                                  eps=layernorm_epsilon)#调用LayerNorm进行权重值的约束
        self.scale_normalization = scale_normalization
        #是否约束第三，四层网络层的权重
        if scale_normalization:
            self.third_layernorm = LayerNorm(hidden_size,
                                                    eps=layernorm_epsilon)
            self.fourth_layernorm = LayerNorm(hidden_size,
                                                    eps=layernorm_epsilon)

（5）MLP定义


        # MLP
        self.mlp = GPT2ParallelMLP(
            hidden_size,
            output_dropout_prob,
            init_method,
            output_layer_init_method=output_layer_init_method)

2、forward


    def forward(self, hidden_states, ltor_mask, pivot_idx=None, is_sparse=0, mem=None):
        # hidden_states: [b, s, h] 上一层的输出作为这一层的输入
        # ltor_mask: [1, 1, s, s] attention mask矩阵

（1）第一层：对输入进行LayerNorm处理


        # Layer norm at the begining of the transformer layer.
        layernorm_output1 = self.input_layernorm(hidden_states)
        mem = self.input_layernorm(mem) if mem is not None else None#对记忆模块也进行LayerNorm处理

（2）第二层：自我注意力机制

输出大小[b, s, h]


        # Self attention.
        attention_output = self.attention(layernorm_output1, ltor_mask, pivot_idx, is_sparse, mem)

（3）第三层：Third LayerNorm

判断是否对第二层网络层输出的权重进行LayerNorm处理，形成第三层


        # Third LayerNorm
        if self.scale_normalization:
            attention_output = self.third_layernorm(attention_output)

（4）第四层

首先构建残差网络：将输出和输入相加作为 output


        # Residual connection.
        layernorm_input = hidden_states + attention_output

再对其做LayerNorm处理


        # Layer norm post the self attention.
        layernorm_output = self.post_attention_layernorm(layernorm_input)

最后来一波MLP，即做非线性运算变幻，h -> 4*h -> h


        # MLP.
        mlp_output = self.mlp(layernorm_output)

判断是否对第四层网络层的权重进行LayerNorm处理


        # Fourth LayerNorm
        if self.scale_normalization:
            mlp_output = self.fourth_layernorm(mlp_output)

（5）输出层：第二个残差结构


        # Second residual connection.
        output = layernorm_input + mlp_output
 
        return output

欢迎大家在评论区批评指正，谢谢大家~

相关阅读:
电脑查看打印机ip地址的三种方法
 SpringCloud Alibaba - Seata 部署 TC 服务，并集成微服务
 QT常见概念-1
mybatisplus代码生成覆盖
 deque（双端数组）——STL
可以说：未来10年这个行业依然值得进，天花板很高，月薪至少3W
可靠性工程师的发展之路
 Spring【IOC+AOP】待完善....勿看
 compose——布局居中
 windows进程管理相关命令
原文地址：https://blog.csdn.net/weixin_55073640/article/details/126526276