【Transformer系列】深入浅出理解Positional Encoding位置编码

一、参考资料

论文：Attention Is All You Need
一文教你彻底理解Transformer中Positional Encoding
Transformer Architecture: The Positional Encoding
The Annotated Transformer
Master Positional Encoding: Part I
如何理解Transformer论文中的positional encoding，和三角函数有什么关系？
图解Transformer系列一：Positional Encoding（位置编码）
Transformer的位置编码详解

二、`Positional Encoding`相关介绍

1. 引言

在任何一门语言中，词语的位置和顺序对句子意思表达都是至关重要的。传统RNN模型天然有序，在处理句子时，以序列的模式逐个处理句子中的词语，这使得词语的顺序信息在处理过程中被天然的保存下来，并不需要额外的处理。

由于Transformer模型没有RNN（循环神经网络）或CNN（卷积神经网络）结构，句子中的词语都是同时进入网络进行处理，所以没有明确的关于单词在源句子中位置的相对或绝对的信息。为了让模型理解序列中每个单词的位置（顺序），Transformer论文中提出了使用一种叫做 Positional Encoding（位置编码） 的技术。这种技术通过为每个单词添加一个额外的编码来表示它在序列中的位置，这样模型就能够理解单词在序列中的相对位置。

2. `position-insensitive`

如果模型的输出会随着输入文本数据顺序的变化而变化，那么这个模型就是关于位置敏感的，反之则是位置不敏感的。

用更清晰的数学语言来解释。设模型为函数 $y = f (x)$ ，其中输入为一个词序列 $x=\{x_1,x_2,\ldots,x_n\}$ ，输出结果为向量 $y$ 。对 $x$ 的任意置换 $x^{'}=\{x_{k_{1}},x_{k_{2}},\ldots,x_{k_{n}}\}$ ，都有
$f(x)=f(x^{'})$
则模型 $f$ 是关于位置不敏感的。

在我们常用的文本模型中，RNN 和 textCNN 都是关于位置敏感的，使用它们对文本数据建模时，模型结构天然考虑了文本中词与词之间的顺序关系。而以 attention 为核心的 transformer 则是位置不敏感的，使用这一类位置不敏感( position-insensitive )的模型对文本数据建模时，需要额外加入 positional encoding 引入文本中词与词的顺序关系。

3. `Positional Encoding`的概念

RNN作为特征提取器，是自带词的前后顺序信息的；而Attention机制并没有考虑先后顺序信息，但前后顺序信息对语义影响很大，因此需要通过Positional Embedding这种方式把前后位置信息加在输入的Embedding上。

一句话概括，Positional Encoding就是将位置信息添加（嵌入）到Embedding词向量中，让Transformer保留词向量的位置信息，可以提高模型对序列的理解能力。

4. `Positional Encoding`的满足条件

以往我们根据单词之间的间隔比例算距离，如果设置整个句子长度为1，如：Attention is all you need ，其中is和you之间的距离为0.5。而 To follow along you will first need to install PyTorch 较长文本中子里的0.5距离则会隔很多单词，这显然不合适。

所以，总结一下理想的位置编码应该满足：

为每个字输出唯一的编码；
不同长度的句子之间，任何两个字之间的差值应该保持一致；
编码值应该是有界的。

5. Positional Encoding的特性

每个位置有一个唯一的Positional Encoding；
两个位置之间的关系可以通过它们位置间的仿射变换来建模（获得）。

6. `Positional Encoding`分类

如何优雅地编码文本中的位置信息？三种positional encoding方法简述

6.1 绝对位置编码

Learned Positional Embedding方法是最普遍的绝对位置编码方法，该方法直接对不同的位置随机初始化一个 postion embedding，加到 word embedding 上输入模型，作为参数进行训练。
在这里插入图片描述

6.2 相对位置编码

使用绝对位置编码，不同位置对应的 positional embedding 固然不同，但是位置1和位置2的距离比位置3和位置10的距离更近，位置1和位置2、位置3和位置4都只相差1，这些体现了相对位置编码。

常用的相对位置编码方法有Sinusoidal Positional Encoding 和 Learned Positional Encoding。其中，Sinusoidal Positional Encoding 是通过将正弦和余弦函数的不同频率应用于输入序列的位置来计算位置编码；Learned Positional Encoding 是通过学习一组可学习参数来计算位置编码。

在《Attention is all you need》里提到，Learned Positional Embedding和Sinusoidal Position Encoding两种方式的效果没有明显的差别。在论文[3]，实验结果表明使用Complex embedding相较前两种方法有较明显的提升。

6.3 其他位置编码

`Complex embedding`

参考论文：Encoding Word Oder In Complex Embeddings

7. 位置向量与词向量

一般来说，可以使用向量拼接或者相加的方式，将位置向量和词向量相结合。

input = input_embedding + positional_encoding

这里，input_embedding 是通过常规Embedding层，将每一个token的向量维度从vocab_size映射到 d_model。由于是相加关系，则 positional_encoding 也是一个 d_model 维度的向量。（原论文中，d_model=512）
在这里插入图片描述

三、`Positional Encoding`原理

本节以 Sinusoidal Positional Encoding 为例，介绍 Positional Encoding 的原理。

1. 原理解析

Transformer论文中，使用正余弦函数表示绝对位置，通过两者乘积得到相对位置。因为正余弦函数具有周期性，可以很好地表示序列中单词的相对位置。

BERT用了Transformer，但位置信息是训练出来的，没有用正弦余弦；正弦余弦是考虑到语言的语义和相对位置有关而与绝对位置关系不大，一句话放在文首还是文中还是文末，排除特殊情况后语义应该是差不多的。所以只要合理设计，用其他周期函数也可以。

对于 pos 位置的 positional encoding：
$P E_{(p os, 2 i)} P E_{(p os, 2 i + 1)} = s in (\frac{p os}{1000 0 ^{\frac{2 i}{d _{m o d e l}}}}) = cos (\frac{p os}{1000 0 ^{\frac{2 i}{d _{m o d e l}}}}) (1) (2)$
其中，pos表示token在序列中的位置，设句子长度为 L，则 $pos=0,1,\ldots,L-1$ ； $PE$ 是token的位置向量， $PE (p os, 2 i)$ 表示这个位置向量里的第i个元素， $i$ 表示奇数维度， $2 i$ 表示偶数维度； $d_{model}$ 表示token的维度（通常为512）。

从公式中可以看出，一个词语的位置编码是由不同频率的余弦函数组成的，从低位到高位，余弦函数对应的频率由1降低到 $\frac{1}{10000}$ ，波长从 $2\pi$ 增加到 $10000\cdot2\pi$ 。这样设计的好处是：pos+k 位置的 positional encoding 可以被 pos 线性表示，体现其相对位置关系。

虽然 Sinusoidal Position Encoding 看起来很复杂，但是证明 pos+k 可以被 pos 线性表示，只需要用到高中的正弦余弦公式：
$s in (α + β) = s in α \cdot cos β + cos α \cdot s in β cos (α + β) = cos α \cdot cos β - s in α \cdot s in β (3) (4)$
对于 pos+k 的 positional encoding：
$P E_{(p os + k, 2 i)} P E_{(p os + k; 2 i + 1)} = s in (w_{i} \cdot (p os + k)) = s in (w_{i} p os) cos (w_{i} k) + cos (w_{i} p os) s in (w_{i} k) = cos (w_{i} \cdot (p os + k)) = cos (w_{i} p os) cos (w_{i} k) - s in (w_{i} p os) s in (w_{i} k) (5) (6)$
其中 $w_{i}=\frac{1}{10000^{2i/d_{model}}}$ 。

将公式（5）（6）稍作调整，就有：
$P E_{(p os + k, 2 i)} P E_{(p os + k, 2 i + 1)} = cos (w_{i} k) P E_{(p os, 2 i)} + s in (w_{i} k) P E_{(p os, 2 i + 1)} = cos (w_{i} k) P E_{(p os, 2 i + 1)} - s in (w_{i} k) P E_{(p os, 2 i)} (7) (8)$
注意，pos和pos+k相对距离k是常数，所以有：
$\left.\left[PE(pos+k,2i)PE(pos+k,2i+1)\right.\right]=\left[uv−vu\right]\times\left[PE(pos,2i)PE(pos,2i+1)\right]\quad(9)$
其中， $u=cos(w_{i}\cdot k),v=sin(w_{i}\cdot k)$ 为常数。

可以看出，对于 $p os + k$ 位置的位置向量某一维 $2 i$ 或 $2 i + 1$ 而言，可以表示为： $p os$ 位置与 $k$ 位置的位置向量的 $2 i$ 与 $2 i + 1$ 维的线性组合，这样的线性组合意味着位置向量中蕴含了相对位置信息。所以 $PE_{pos+k}$ 可以被 $PE_{pos}$ 线性表示。

计算 $PE_{pos+k}$ 和 $PE_{pos}$ 的内积，有：
$P E_{p os} \cdot P E_{p os + k} = i = 0 \sum \frac{d}{2} - 1 s in (w_{i} p os) \cdot s in (w_{i} (p os + k)) + cos (w_{i} p os) \cdot cos (w_{i} (p os + k)) = i = 0 \sum \frac{d}{2} - 1 cos (w_{i} (p os - (p os + k)) = i = 0 \sum \frac{d}{2} - 1 cos (w_{i} k) (10)$
其中， $w_{i}={\frac{1}{10000^{2i/d_{model}}}}$ 。

$PE_{pos+k}$ 和 $PE_{pos}$ 的内积会随着相对位置的递增而减小，从而表征位置的相对距离。

但是，不难发现，由于距离的对称性， Sinusoidal Position Encoding方法虽然能够反映相对位置的距离关系，但是无法区分方向：
$PE_{pos+k}PE_{pos}=PE_{pos-k}PE_{pos}$

在这里插入图片描述

2. 通俗理解

最简单直观的加入位置信息的方式就是使用1，2，3，4，…直接对句子进行位置编码（one-hot）。用二进制转化举个例子：
在这里插入图片描述

上表中维度0，维度1，维度2，维度3拼成的数字就是该位置对应的二进制表示。可以看到每个维度（每一列）其实都是有周期的，并且周期是不同的。具体来说，每个比特位的变化率都是不一样的，越低位的变化越快（越往右边走，变化频率越快），红色位置0和1每个数字会变化一次，而黄色位，每8个数字才会变化一次。这样就能够说明使用多个周期不同的周期函数组成的多维度编码和递增序列编码其实是可以等价的。这也回答了为什么周期函数能够引入位置信息。

同样的道理，不同频率的sin正弦函数和cos余弦函数组合，通过调整三角函数的频率，可以实现这种低位到高位的变化，这样就能把位置信息表示出来。128维位置编码2D示意图，如下图所示：
在这里插入图片描述

四、`Positional Encoding`代码实现

1. 方式一

参考OpenNMT中的代码实现：onmt/modules/embeddings.py

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()       
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        #pe.requires_grad = False
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

2. 方式二

大语言模型核心技术-Transformer 详解

class PositionalEncoding(nn.Module):
    """
    compute sinusoid encoding.
    """
    def __init__(self, d_model, max_len, device):
        """
        constructor of sinusoid encoding class

        :param d_model: dimension of model
        :param max_len: max sequence length
        :param device: hardware device setting
        """
        super(PositionalEncoding, self).__init__()

        # same size with input matrix (for adding with input matrix)
        self.encoding = torch.zeros(max_len, d_model, device=device)
        self.encoding.requires_grad = False  # we don't need to compute gradient

        pos = torch.arange(0, max_len, device=device)
        pos = pos.float().unsqueeze(dim=1)
        # 1D => 2D unsqueeze to represent word's position

        _2i = torch.arange(0, d_model, step=2, device=device).float()
        # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50])
        # "step=2" means 'i' multiplied with two (same with 2 * i)

        self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model)))
        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model)))
        # compute positional encoding to consider positional information of words

    def forward(self, x):
        # self.encoding
        # [max_len = 512, d_model = 512]

        batch_size, seq_len = x.size()
        # [batch_size = 128, seq_len = 30]

        return self.encoding[:seq_len, :]
        # [seq_len = 30, d_model = 512]
        # it will add with tok_emb : [128, 30, 512]         

class TokenEmbedding(nn.Embedding):
    """
    Token Embedding using torch.nn
    they will dense representation of word using weighted matrix
    """

    def __init__(self, vocab_size, d_model):
        """
        class for token embedding that included positional information
        :param vocab_size: size of vocabulary
        :param d_model: dimensions of model
        """
        super(TokenEmbedding, self).__init__(vocab_size, d_model, padding_idx=1)

class TransformerEmbedding(nn.Module):
    """
    token embedding + positional encoding (sinusoid)
    positional encoding can give positional information to network
    """

    def __init__(self, vocab_size, max_len, d_model, drop_prob, device):
        """
        class for word embedding that included positional information
        :param vocab_size: size of vocabulary
        :param d_model: dimensions of model
        """
        super(TransformerEmbedding, self).__init__()
        self.tok_emb = TokenEmbedding(vocab_size, d_model)
        self.pos_emb = PositionalEncoding(d_model, max_len, device)
        self.drop_out = nn.Dropout(p=drop_prob)

    def forward(self, x):
        tok_emb = self.tok_emb(x)
        pos_emb = self.pos_emb(x)
        return self.drop_out(tok_emb + pos_emb)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

相关阅读:
java抽象工厂&责任链模式&观察者模式
C# WPF入门学习主线篇（七）—— Label常见属性和事件
媒介易发稿教程，在人民网投稿的指南与技巧
Qt 布局(QLayout 类&QStackedWidget 类) 总结
Mybatis中XML中传不同类型的参数时，collection注意事项
【已解决】Vue全局引入scss 个别页面不生效 / 不自动引入全局样式
AWS SAP-C02教程10-其它服务
JavaScript事件之拖拽事件（详解）
Python xml.dom.minidom 读取xml
Shopee市场爆单难？找准选品逻辑方式

原文地址：https://blog.csdn.net/m0_37605642/article/details/132866365