李宏毅机器学习笔记-transformer

李宏毅机器学习笔记-transformer
文章目录
一、内容介绍

transformer是什么呢？是一个seq2seq的model。具体应用如上图所示，输入和输出的序列长度不固定，由model自己决定。
语音翻译指的是，直接输入一段语音信号，例如英文，输出的直接是翻译之后的中文。

seq2seq如今已经是一个应用非常广泛的模型，可以应用于NLP的各种任务，如语义分析，语义分类，聊天机器人等。另外还有个值得说明的功能是做multi label classification。

multiclassification

multi label classification和multi class classificatio是完全不一样的，一个是多分类，另一个是一个样本可以有多个标签。但是多标签的问题，可以用seq2seq模型来解决。
我们可以想下，如果让你来做多标签分类问题，会有什么思路。
一般人可能会想到，集成学习中，对每个类别都输出一个概率，然后例如说取一个threshold，取得分最高的前3名就好了，这样每个样本就都可以得到多个标签了。
这样做的一个问题在于，有些样本可能是一个标签，有些可能是3个，这种threshold的方法不能从根本上解决问题。
使用的方法就是用seq2seq硬做，可以输入一篇文章，然后输出就是不同的类别，输出类别的个数由model自己决定。

ok，我们现在开始正式学习什么是seq2seq。一个完整的seq2seq通常由一个encoder和一个decoder组成。上图右侧即为一个transformer架构。左半部分为encoder，右半部分为decoder。

encoder要做的事情就在于，输入一排向量，输出另一排向量。这个过程由RNN或CNN，self-attention都可以做到。但在transformer中，使用的则是self attention。

encoder

之前的图较为复杂，我们使用更简洁一些的图来解释encoder。
如上图所示，一个encoder中由很多的block组成。注意，这里每个block并不是由一层的layer组成，而是好几层的layer。其中的一个block可能就如右侧所示，由一个self attention处理之后，再经过一个FC层得到进一步的输出。

在原始的那篇transfomer文章中，每个Block做的事情可能更为复杂。在self-attention的基础上还加入了residual connection的结构进去。什么意思呢，就是原本由self-attention，每个输入可以得到一个输出，这个输出是考虑了整个seq上下文信息的输出。但这个时候，我们在这个输出的基础上，再把原始的input加进去，这种思想就是residual connection.
做完residual connection之后，再做layer norm。这个layer norm很简单，就是输入一串序列[x1,x2,…,xk]，输出另一串序列。对输入做的处理是计算均值和标准差，和zscore非常接近。
经过layer norm之后，讲输出经过一个FC，再和当前的值进行相加，最后再经过一个norm层，才是我们整个encoder最后的输出。最左侧的图和最右侧的图可以结合起来看。

现在，上述过程可以和前面那张较为复杂的图对应起来。复杂图里面多了一个positional encoding，因为在self attention必须考虑位置资讯，可以回归下self attention相关内容。然后上图中的Add & Norm就是residual connection 和layer norm的过程，feed forward则是一个Fully conneted network。另外，这里特地强调了是multihead attention。
注意，上述只是按照transformer原始论文所讲述的encoder的架构，其中一些模块的顺序也可以直接调换。

decoder

ok，我们接下来讲解decoder。

对于decoder主要有2种，我们主要先讲autoregression,AT。

在经过encoder之后，会得到一排中间向量，将这些向量输入到decoder中，用于产生输出。这里注意，在产生输出前，我们会加上一个begin的标志，在输出结束后，还有一个end的标志。这两个标志属于模型自己要学习的东西，因此，这样就可以做到模型自己决定输出的长度是多少了。
这里，decoder会将上一个时刻的输出作为下一个时刻的输入。
这种情况可能会导致一个error propagation的问题，即一步错导致步步错。
当然，这个error propagation是有处理的办法的，我们先无视这个问题。

我们先将encoder部分忽视，decoder则为上述的样子。

我们将encoder和decoder进行对比，可以发现，其实2者的区别还是很小的，只有2部分不太一样，一个是用马赛克盖住的部分，另一个是masked multi-head attention。

msak multi-head attention

self attention和masked self attention的区别在于说，在普通的self attention中，我们由a1,a2,a3,a4生成b1,b2,b3,b4时，例如说生成b2，我们是考虑了a1,a2,a3,a4的所有信息的。
但是，在masked self attention中，我们要生成b2，只能考虑a1,a2的信息，不能考虑a3,a4的信息。
为什么要这样设计呢，我们想下decoder的运作方式，输出是一个一个产生的，所以，只能考虑之前输出的信息。

我们下面来开始讲另一种decoder，Non-autoregressive, NAT。

前面有讲到说，AT是decoder一个一个生成输出的，而NAT是一次性生成所有的输出的，包括start和end。
这里就会有人有疑问，不是说输出长度可能是不固定的吗？但是NAT输出长度是固定的怎么办？
有2种思路，一种是另外再训练一个回归预测器，预测输出的长度。第二种是在输出的中间加入end，在end之后的输出就不管他了，当作没有输出一样。

encoder和decoder如何传递咨询的-cross attention

ok，我们现在来讲下encoder和decoder之间的信息传递，也就是之前用马赛克盖住的那部分。这个过程也叫做cross attention。上图中左边2个箭头来自于encoder，右边1个来自于decoder。

具体来说呢，左边经过encoder之后会得到一系列的a1,a2,a3等输出向量，类似于self attention过程，产生k,v。右侧decoder经过masked self attention之后，得到一个输出向量q，由q,k之间计算得到attention acore $\alpha_{1}'$ ，与对应的v1相乘之后得到总的输出v，最后再进入FC层进行处理。这个过程就叫做cross attention。

train

ok，讲完encoder和decoder之后，我们需要讲下训练的部分。

这里跟普通的分类比较像，使用cross entropy作为损失函数。

这里，在训练decoder时，我们会讲正确的答案作为decoder的输入，这个过程叫做teacher forcing。
这里就需要讲下之前所提到的那个问题，decoder在训练时，输入是正确的答案，但是在测试时，没有正确的答案给到进行输入。那么decoder就很容易产生一步错，步步错的问题。这个问题也叫做exposure bias。一个可能解决该问题的方向是scheduled sampling，就是在训练decoder时偶尔喂给一些不正确的数据，提升decoder的处理问题的能力，就这么简单。

二、transformer的代码实现

 2.1需求

假设我存在一个dataset，前10列为特征，最后1列为标签，我们现在来实现一下transformer做一个分类任务。

2.2 实现

 2.2.1 如何实现position-encoding
```
import torch
import torch.nn as nn

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):# max_len表示最大的可能的序列长度，可以设置的大一些
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        self.encoding[:, 0::2] = torch.sin(position * div_term)
        self.encoding[:, 1::2] = torch.cos(position * div_term)
        self.encoding = self.encoding.unsqueeze(0)

    def forward(self, x):
        return x + self.encoding[:, :x.size(1)].detach()

# Example usage
d_model = 512
max_len = 100
positional_encoding = PositionalEncoding(d_model, max_len)

input_sequence = torch.rand(1, max_len, d_model)
output_sequence = positional_encoding(input_sequence)
print(input_sequence )
print(output_sequence )
# output
torch.Size([1, 100, 512])
torch.Size([1, 100, 512])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
```
2.2.2 Transformer,TransformerEncoder,TransformerEncoderLayer

在torch.nn中共有3个相关的实现函数，先简述下区别
总体而言，Transformer 是整个模型，TransformerEncoder 是模型中的编码器部分，而 TransformerEncoderLayer 是编码器中的一个层。Transformer 模型的设计允许通过堆叠多个编码器层来捕捉输入序列的复杂关系，同时保持了模型的并行性。
我们再来依次看下这些函数

Transformer

看下输入都有哪些
d_model，表示输入input的特征维度，默认是512.
nhead，表示multiheadattention模块中的nhead的数量
num_encoder_layers表示在encoder中subencoder的数量，默认是6
num_decoder_layers表示在decoder中subdecoder的数量，默认值也是6
dropout，无需多言
activateion，表示激活函数，可以选择relu和gelu，默认relu，
custom_encoder和custom_decoder是可选的自定义的编码解码函数
batch_first，指定的输入的摆列顺序是batch_size在前还是seq_len在前。
norm_first，如果为True，则在经过multiheadattention之后，去到其他attention和feedforward之前会先进行LayerNorms，默认是Fault,即在最后经过feedforward之后再进行layerNorm.
```
import torch
import torch.nn as nn
import math


transformer_model = nn.Transformer(nhead=16 , num_encoder_layers=1,num_decoder_layers=1)
print(transformer_model)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)
print(out.shape)
1
2
3
4
5
6
7
8
9
10
11
```
```
Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
)
torch.Size([20, 32, 512])

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
```
可以看到，正常的一个transformer同时包含了encoder和decoder，是比较完整的模型。

transformerencoder

这里比较简单，在定义TransformerEncoder类之前一般都会预先定义一个TransformerEncoderlayer类，然后再通过TransformerEncoder类进行实例化。

TransformerEncoderLayer

TransformerEncoderLayer 主要由self-attn and feedforward network组成。
参数如下：

d_model表示输入的特征维度
nhead表示multiheadattention 中头的数量
dim_feedforward 表示所接的全连接层的维度
```
import torch
import torch.nn as nn
import math


encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8,dim_feedforward=1000)
print(encoder_layer)
src = torch.rand(10, 32, 512)
out = encoder_layer(src)
print(out.shape)
# output
TransformerEncoderLayer(
  (self_attn): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
  )
  (linear1): Linear(in_features=512, out_features=1000, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (linear2): Linear(in_features=1000, out_features=512, bias=True)
  (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (dropout2): Dropout(p=0.1, inplace=False)
)
torch.Size([10, 32, 512])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
```
相关阅读:
痛快，SpringBoot终于禁掉了循环依赖
 conan使用包以及管理（2）
优化代码之使用策略模式(解决if..else..问题)
虚拟机非正常关闭导致无法打开问题
 MariaDB 10.5，MySQL乱码问题，设置字符编码UTF8
[附源码]SSM计算机毕业设计在线购物系统JAVA
图数据挖掘：幂律分布和无标度网络
 24.GRASP模式
 在Java开发中无法绕开的框架：SpringBoot
sql注入
原文地址：https://blog.csdn.net/weixin_43249038/article/details/133216521

文章目录

一、内容介绍

multiclassification

encoder

decoder

msak multi-head attention

encoder和decoder如何传递咨询的-cross attention

train

二、transformer的代码实现

2.1需求

2.2 实现

2.2.1 如何实现position-encoding

2.2.2 Transformer,TransformerEncoder,TransformerEncoderLayer

Transformer

transformerencoder

TransformerEncoderLayer