NLP模型（二）——GloVe实现

文章目录

1. 整体思路
2. 数据准备
3. 构造共现矩阵
4. 得到序列
5. 创建数据管道
6. 模型构建
7. 模型训练
8. 加载模型测试

1. 整体思路

在这个算法中，为了使得效果比较有对比性，我们仍然采用前面word2vec算法实现时的数据来进行GloVe模型的实现，为此，这里的数据处理和数据准备（即剔除标点、分词、得到数据与编号的字典）过程都可以拿过来用，当然这里还多了一步，构造共现矩阵的步骤。因为GloVe模型实际上就是对word2vec模型的一种改进，只不过训练的参数多了两个偏置项，损失的函数也发生了变化而已，所以，构建模型时，我们也可以采用 nn.Embedding 层，这样，构建数据管道的时候，只需要传入具体的数字即可，不需要传入One-hot编码，且数据管道那里还少了负采样的步骤，最后就是利用构建的模型对数据进行训练得到词向量了。

2. 数据准备

按照word2vec的实现中对数据处理的方法，将文本中的中文提取出来，并将每一句话使用 jieba 进行分词存储在 test.txt 文件中后，就需要对文本中的词进行编号，得到编号与词的映射以及词与编号的映射，还需要设置一个最大窗口数 $MAX\_SIZE$ ，提取出频次在 $MAX\_SIZE-1$ 的词语，然后剩下的词语全部归在 $< U N K >$ 即未知词下面。代码如下：

from collections import Counter
import numpy as np

# 最大词数
MAX_SIZE = 10000
# 训练的词向量维度
embedding_size = 100
# 单边窗口数
single_win_size = 3 
# 最大词频
x_max = 100
batch_size = 32
lr = 1e-3
epoch = 5


with open("data/test.txt", 'r', encoding='utf-8') as f:
    content = f.read().split(" ")

words = dict(Counter(content).most_common(MAX_SIZE-1))
words[''] = len(content) - np.sum(list(words.values()))

word2idx = {word:i for i, word in enumerate(words.keys())}
idx2word = {i:word for i, word in enumerate(words.keys())}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

3. 构造共现矩阵

在GloVe模型中起了很大作用的就是共现矩阵，所以构建共现矩阵十分重要。

首先，共现矩阵是一个 $10000 \times 10000$ 维度的大小，我们用行索引代表中心词，列索引代表背景词，大致逻辑如下：遍历每个中心词及其窗口内的词，将出现在中心词窗口内的背景词对应索引（中心词，背景词）的共现次数+1，窗口不断滑动，直到窗口滑动到末尾，返回构建的共现矩阵。代码如下：

def get_co_occurrence_matrix(content, word2idx):
    '''
    构建共现矩阵
    :param content: 文章分词后的列表
    :param word2idx: 字典（词：ID）
    :return: 共现矩阵
    '''
    # 初始化共现矩阵
    matrix = np.zeros((MAX_SIZE, MAX_SIZE), np.int32)
    # 单词列表转为编码
    content_encode = [word2idx.get(w, MAX_SIZE - 1) for w in content]
    # 遍历每一个中心词
    for i, center_id in enumerate(content_encode):
        # 取得同一窗口词在文中的索引
        pos_indices = list(range(i - single_win_size, i)) + list(range(i + 1, i + single_win_size + 1))
        # 取得同一窗口的词索引，避免越界，使用取模操作
        window = [j % len(content) for j in pos_indices]
        # 取得词对应的ID
        window_id = [content_encode[j] for j in window]
        # 使得中心词对应的背景词次数+1
        for j in window_id:
            matrix[center_id][j] += 1

    return matrix

matrix = get_co_occurrence_matrix(content, word2idx)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

4. 得到序列

在GloVe的训练中，由于要创建共现矩阵，但是，共现矩阵中肯定不可能是每个元素与每个元素都有关系，即共现矩阵中必定会有值为0的元素，这种元素由于带入惩罚函数后惩罚是0，所以代入损失得到后损失值也是0，对我们的训练时没有任何帮助的，反而会加大我们的训练量（没错，我试过）并且在使用log函数时还需要判断是否为0，所以，在这里我们选择仅对非零元素进行训练，这就需要首先提取出非零元素的标号，然后使用数据管道提供的index来进行索引，得到具体的行列索引值。这里先实现得到非零元素的序列这一步。代码如下:

def get_nozero(matrix):
	# 得到矩阵中非零元素的索引
    index_nozero = []
    for i in range(MAX_SIZE):
        for j in range(i+1):
            if matrix[i][j] != 0:
            	# 将[i,j]和[j,i]都添加进去是为了让一个词在中心词矩阵和背景词矩阵都得到训练
                index_nozero.append([i,j])
                index_nozero.append([j,i])
    return index_nozero

index_nozero = get_nozero(matrix)
1
2
3
4
5
6
7
8
9
10
11
12

5. 创建数据管道

我们再来看看要求的损失函数需要用到的数据
$loss=\sum_{i,k}f(x_{ik})(v_i^Tu_k + b_i+b_k-\log x_{ik})^2$ 由于 $v_i,u_k,b_i,b_k$ 都是训练的数据，所以我们的数据只需要传入 $f(x_{ik}),x_{ik}$ 即可，如果将惩罚函数重新建立一个矩阵的话，那么所消耗的时间与空间将会特别大（我试过，确实很慢），所以我们选择将其在数据管道中进行实现。

数据管道中要取得的元素是非零元素的索引，所以这里的 __len__ 方法返回的长度应该为 index_nozero 的长度。

最后，考虑一下我们需要从数据中拿到什么。从损失函数的表达式中，我们可以看出，需要的是元素 $x_{ik}$ 及其惩罚 $f(x_{ik})$ ，并且损失函数中的向量 $v_i,u_k$ 的角标与 $x_{ik}$ 角标一致，所以还需要返回行列的索引值。代码如下

from torch.utils.data import Dataset, DataLoader
import torch

class GloVeDataset(Dataset):
    def __init__(self, matrix, index_nozero):
        super(GloVeDataset, self).__init__()  # 第一行必须是这个
        self.matrix = torch.Tensor(matrix)
        self.index_nozero = index_nozero


    def __len__(self):
        return len(index_nozero)


    def __getitem__(self, idx):
        row = self.index_nozero[idx][0]
        column = self.index_nozero[idx][1]
        # 这里后面必须是张量数据，否则拼接的时候会报错
        x_ik = torch.tensor([self.matrix[row][column]])
        punish_x = torch.tensor([x_ik ** (0.75) if x_ik < x_max else 1])
        
        return row, column, x_ik, punish_x

glove_dataset = GloVeDataset(matrix, index_nozero)
dataloader = DataLoader(glove_dataset, batch_size, shuffle=True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

打印一下数据管道的输出如下

for i, (row, clolumn, x_il, punish_x) in enumerate(dataloader):
    print(row)
    print(clolumn)
    print(x_il)
    print(punish_x)
    break
1
2
3
4
5
6

tensor([1775, 9, 5402, 7567, 2833, 263, 185, 174, 1200, 3400, 893, 760,
689, 765, 0, 5547, 537, 631, 8, 1331, 2072, 19, 225, 1478,
0, 51, 712, 4165, 192, 5550, 669, 2781])
tensor([ 598, 40, 71, 420, 1279, 3015, 1272, 3649, 3736, 1710, 94, 4074,
3233, 720, 6686, 5241, 179, 9079, 635, 7341, 157, 393, 287, 2015,
2983, 7931, 707, 2992, 5846, 926, 898, 1398])
tensor([[ 2.],
[20.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 2.],
[ 1.],
[ 2.],
[ 1.],
[ 3.],
[ 1.],
[ 9.],
[ 1.],
[ 1.],
[ 1.],
[ 2.],
[ 1.],
[ 3.],
[ 1.],
[ 1.],
[ 1.],
[ 1.],
[ 2.],
[ 1.],
[ 1.]])
tensor([[1.6818],
[9.4574],
[1.0000],
[1.0000],
[1.0000],
[1.0000],
[1.0000],
[1.0000],
[1.0000],
[1.0000],
[1.0000],
[1.0000],
[1.6818],
[1.0000],
[1.6818],
[1.0000],
[2.2795],
[1.0000],
[5.1962],
[1.0000],
[1.0000],
[1.0000],
[1.6818],
[1.0000],
[2.2795],
[1.0000],
[1.0000],
[1.0000],
[1.0000],
[1.6818],
[1.0000],
[1.0000]])

6. 模型构建

接下来就是模型的构建了。模型构建首先我们要知道我们需要训练的参数是什么。第一点首先是词向量，因为GloVe是word2vec模型改进而来的，所以关于中心词向量以及背景词向量的训练时必须保留的；第二点是偏置项，从下面损失函数公式中我们可以清晰的看到，偏置项 $b_i,b_k$ 也是我们需要训练的参数。
$loss=\sum_{i,k}f(x_{ik})(v_i^Tu_k + b_i+b_k-\log x_{ik})^2$ 明白了这些点后，就可以开始构建模型了。

class GloVe(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(GloVe, self).__init__()

        self.vocab_size = vocab_size
        self.embed_size = embed_size

        # 中心词矩阵
        self.center_embed = nn.Embedding(self.vocab_size, self.embed_size)
        # 背景词矩阵
        self.backgroud_embed = nn.Embedding(self.vocab_size, self.embed_size)

        # 中心词偏置，偏置为一个常数，故为1维
        self.center_bias = nn.Embedding(self.vocab_size, 1)
        # 背景词偏置
        self.backgroud_bias = nn.Embedding(self.vocab_size, 1)


    def forward(self, row, column, x_ik, punish_x):
        '''
        注意输入是按批次输入的，所以其维度与批次一样
        :param row: [batch_size]
        :param column: [batch_size]
        :param x_ik: [batch_size, 1]
        :param punish_x: [batch_size, 1]
        :return:
        '''
        v_i = self.center_embed(row) # [batch_size, embed_size]
        u_k = self.backgroud_embed(column) # [batch_size, embed_size]
        b_i = self.center_bias(row)  # [batch_size, 1]
        # 需要将其变为一维才能正常做加法
        b_i = b_i.squeeze(1)  # [batch_size]

        b_k = self.backgroud_bias(column)  # [batch_size, 1]
        b_k = b_k.squeeze(1)  # [batch_size]

		x_ik = x_ik.squeeze(1)  # [batch_size]
        punish_x = punish_x.squeeze(1)  # [batch_size]

        # 按照损失函数计算损失即可
        loss = punish_x * (torch.mul(v_i, u_k).sum(dim=1) + b_i + b_k - torch.log(x_ik)) ** 2

        return loss

	def get_predic_vec(self):
        # 采用作者的方法，返回两者相加的权重
        return self.center_embed.weight.data.cpu().numpy()+self.backgroud_embed.weight.data.cpu().numpy()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GloVe(MAX_SIZE, embedding_size).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

7. 模型训练

最后则是模型的训练了，模型训练按照前面的格式来就行，这一块比较简单。

def train_model():
    #训练模型
    for e in range(epoch):
        for i, (row, clolumn, x_il, punish_x) in enumerate(dataloader):
            row = row.to(device)
            clolumn = clolumn.to(device)
            x_il = x_il.to(device)
            punish_x = punish_x.to(device)

            optimizer.zero_grad()
            loss = model(row, clolumn, x_il, punish_x).mean()
            loss.backward()

            optimizer.step()

            if i % 1000 == 0:
                print('epoch', e, 'iteration', i, loss.item())

    torch.save(model.state_dict(), "data/glove-{}.th".format(embedding_size))

train_model()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

8. 加载模型测试

上面训练好了模型后，我们可以加载模型对词的相关性进行一个测试，测试采用与前面的word2vec一样的方法，输入一个词，按照余弦相似度返回与这个词最相关的100个词，因为我文本也是与word2vec实验时用的一样的，所以这里的测试可以看出两者测试的一个差别。

from sklearn.metrics.pairwise import cosine_similarity

def find_word(word):
    '''
    计算并输出与输入词最相关的100个词
    :param word: 输入词
    :return:
    '''
    # 加载模型
    model = GloVe(MAX_SIZE, embedding_size)
    model.load_state_dict(torch.load("data/glove-100.th"))
    # 获取中心词矩阵
    embedding_weight = model.get_predic_vec()
    # 得到词与词向量的字典
    word2embedding = {}
    for i in words:
        word2embedding[i] = embedding_weight[word2idx[i]]
    # 得到输入词与其他词向量的余弦相似度
    other = {}
    for i in words:
        if i == word:
            continue
        # 计算余弦相似度
        other[i] = cosine_similarity(word2embedding[word].reshape(1, -1), word2embedding[i].reshape(1, -1))

    # 对余弦相似度按从大到小排序
    other = sorted(other.items(), key=lambda x: x[1], reverse=True)
    count = 0
    # 输出排序前100的相似度词语
    for i, j in other:
        print("({},{})".format(i, j))
        count += 1
        if count == 100:
            break


find_word('大师')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

测试结果如下所示，碍于篇幅，就不放全部的结果了。

(弗兰德,[[0.4906645]])
(听,[[0.44961697]])
(点头,[[0.38447148]])
(老师,[[0.37548357]])
(道,[[0.36903727]])
(教导,[[0.36127198]])
(点,[[0.35702538]])
(说,[[0.32980374]])
(告诉,[[0.316578]])
(目光,[[0.31207395]])
(一眼,[[0.3024309]])
(明白,[[0.29733318]])
(唐昊,[[0.29704148]])
(二龙,[[0.29656404]])
(众人,[[0.29131022]])
(微笑,[[0.29005405]])
(赵无极,[[0.2833815]])
(身边,[[0.28156656]])
(摇头,[[0.28018552]])
(孩子,[[0.2790551]])
(淡然,[[0.2748983]])
(风致,[[0.27243102]])
(柳,[[0.26889658]])
(指点,[[0.2659605]])
(摇,[[0.2657492]])
(问道,[[0.26130816]])
(说道,[[0.25864923]])
(话,[[0.25843865]])
(愣,[[0.2583086]])
(唐三,[[0.25819662]])
(不禁,[[0.25735095]])
(一句,[[0.2567986]])
(看着,[[0.25605386]])
(眼中,[[0.25521308]])
(叹息,[[0.24495934]])
(研究,[[0.24350488]])

全部代码可以在我的Github仓库进行查看。

相关阅读:
刷题记录第二十七天-环形链表II
Sentinel热点参数限流
【Linux】虚拟地址空间
只需根据接口文档，就能轻松开发 get 和 post 请求的脚本，你会做吗？
JAVA基础（JAVA SE）学习笔记（五）数组
YOLOv8-pose关键点检测：模型轻量化创新 |轻量级可重参化EfficientRepBiPAN
R语言ggplot2可视化：使用ggplot2可视化散点图、使用scale_x_log10函数配置X轴的数值范围为对数坐标
Leecode 108:将有序数组转换成二叉搜索树，传指针或引用对程序的影响
【LeetCode】118. 杨辉三角
linux安装FSL

原文地址：https://blog.csdn.net/ifhuke/article/details/127829257