字符序列 “hello” 转换为 one-hot 编码表示:
我们使用一个单层的 RNN(N VS N),隐藏层大小为2,每次传1个字符。初始参数如下:
W
x
h
=
(
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
)
,
W
h
h
=
(
0.1
0.2
0.3
0.4
)
,
W
h
y
=
(
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
)
W_{xh} =
偏置项初始化为0。
输入向量 x 1 = [ 1 , 0 , 0 , 0 ] x_1 = [1, 0, 0, 0] x1=[1,0,0,0]
h
1
=
tanh
(
W
x
h
x
1
+
W
h
h
h
0
)
=
tanh
(
(
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
)
(
1
0
0
0
)
+
(
0.1
0.2
0.3
0.4
)
(
0
0
)
)
=
tanh
(
(
0.1
0.3
)
)
=
(
0.0997
0.2913
)
h_1 = \tanh(W_{xh} x_1 + W_{hh} h_0) = \tanh \left(
y
1
=
W
h
y
h
1
=
(
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
)
(
0.0997
0.2913
)
=
(
0.1695
0.3889
0.6083
0.8277
)
y_1 = W_{hy} h_1 =
预测值 y ^ 1 = softmax ( y 1 ) \hat{y}_1 = \text{softmax}(y_1) y^1=softmax(y1)
假设真实输出为 ‘e’,对应 one-hot 编码为 y 1 = [ 0 , 1 , 0 , 0 ] y_1 = [0, 1, 0, 0] y1=[0,1,0,0]。
交叉熵损失函数:
loss 1 = − ∑ i y 1 i log ( y ^ 1 i ) \text{loss}_1 = - \sum_{i} y_{1i} \log(\hat{y}_{1i}) loss1=−i∑y1ilog(y^1i)
梯度计算:
∂ loss 1 ∂ W h y = ( y ^ 1 − y 1 ) h 1 T \frac{\partial \text{loss}_1}{\partial W_{hy}} = (\hat{y}_1 - y_1) h_1^T ∂Why∂loss1=(y^1−y1)h1T
∂ loss 1 ∂ W x h = ∂ loss 1 ∂ h 1 ⋅ ∂ h 1 ∂ W x h \frac{\partial \text{loss}_1}{\partial W_{xh}} = \frac{\partial \text{loss}_1}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_{xh}} ∂Wxh∂loss1=∂h1∂loss1⋅∂Wxh∂h1
∂ loss 1 ∂ W h h = ∂ loss 1 ∂ h 1 ⋅ ∂ h 1 ∂ W h h \frac{\partial \text{loss}_1}{\partial W_{hh}} = \frac{\partial \text{loss}_1}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_{hh}} ∂Whh∂loss1=∂h1∂loss1⋅∂Whh∂h1
参数更新:
W x h = W x h − η ∂ loss 1 ∂ W x h W_{xh} = W_{xh} - \eta \frac{\partial \text{loss}_1}{\partial W_{xh}} Wxh=Wxh−η∂Wxh∂loss1
W h h = W h h − η ∂ loss 1 ∂ W h h W_{hh} = W_{hh} - \eta \frac{\partial \text{loss}_1}{\partial W_{hh}} Whh=Whh−η∂Whh∂loss1
W h y = W h y − η ∂ loss 1 ∂ W h y W_{hy} = W_{hy} - \eta \frac{\partial \text{loss}_1}{\partial W_{hy}} Why=Why−η∂Why∂loss1
使用更新后的 W x h W_{xh} Wxh、 W h h W_{hh} Whh 和 W h y W_{hy} Why 参数。
输入向量 x 2 = [ 0 , 1 , 0 , 0 ] x_2 = [0, 1, 0, 0] x2=[0,1,0,0]
h
2
=
tanh
(
W
x
h
x
2
+
W
h
h
h
1
)
=
tanh
(
(
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
)
(
0
1
0
0
)
+
(
0.1
0.2
0.3
0.4
)
(
0.0997
0.2913
)
)
h_2 = \tanh(W_{xh} x_2 + W_{hh} h_1) = \tanh \left(
计算后得:
h
2
=
tanh
(
(
0.3
0.7
)
+
(
0.1283
0.2147
)
)
=
tanh
(
(
0.4283
0.9147
)
)
h_2 = \tanh \left(
y 2 = W h y h 2 y_2 = W_{hy} h_2 y2=Whyh2
预测值 y ^ 2 = softmax ( y 2 ) \hat{y}_2 = \text{softmax}(y_2) y^2=softmax(y2)
假设真实输出为 ‘l’,对应 one-hot 编码为 y 2 = [ 0 , 0 , 1 , 0 ] y_2 = [0, 0, 1, 0] y2=[0,0,1,0]。
交叉熵损失函数:
loss 2 = − ∑ i y 2 i log ( y ^ 2 i ) \text{loss}_2 = - \sum_{i} y_{2i} \log(\hat{y}_{2i}) loss2=−i∑y2ilog(y^2i)
梯度计算:
∂ loss 2 ∂ W h y = ( y ^ 2 − y 2 ) h 2 T \frac{\partial \text{loss}_2}{\partial W_{hy}} = (\hat{y}_2 - y_2) h_2^T ∂Why∂loss2=(y^2−y2)h2T
∂ loss 2 ∂ W x h = ∂ loss 2 ∂ h 2 ⋅ ∂ h 2 ∂ W x h \frac{\partial \text{loss}_2}{\partial W_{xh}} = \frac{\partial \text{loss}_2}{\partial h_2} \cdot \frac{\partial h_2}{\partial W_{xh}} ∂Wxh∂loss2=∂h2∂loss2⋅∂Wxh∂h2
∂ loss 2 ∂ W h h = ∂ loss 2 ∂ h 2 ⋅ ∂ h 2 ∂ W h h \frac{\partial \text{loss}_2}{\partial W_{hh}} = \frac{\partial \text{loss}_2}{\partial h_2} \cdot \frac{\partial h_2}{\partial W_{hh}} ∂Whh∂loss2=∂h2∂loss2⋅∂Whh∂h2
参数更新:
W x h = W x h − η ∂ loss 2 ∂ W x h W_{xh} = W_{xh} - \eta \frac{\partial \text{loss}_2}{\partial W_{xh}} Wxh=Wxh−η∂Wxh∂loss2
W h h = W h h − η ∂ loss 2 ∂ W h h W_{hh} = W_{hh} - \eta \frac{\partial \text{loss}_2}{\partial W_{hh}} Whh=Whh−η∂Whh∂loss2
W h y = W h y − η ∂ loss 2 ∂ W h y W_{hy} = W_{hy} - \eta \frac{\partial \text{loss}_2}{\partial W_{hy}} Why=Why−η∂Why∂loss2
使用更新后的 W x h W_{xh} Wxh、 W h h W_{hh} Whh 和 W h y W_{hy} Why 参数。
输入向量 x 3 = [ 0 , 0 , 1 , 0 ] x_3 = [0, 0, 1, 0] x3=[0,0,1,0]
h 3 = tanh ( W x h x 3 + W h h h 2 ) h_3 = \tanh(W_{xh} x_3 + W_{hh} h_2) h3=tanh(Wxhx3+Whhh2)
计算后得:
h
3
=
tanh
(
(
0.5
1.2
)
+
W
h
h
h
2
)
h_3 = \tanh \left(
y 3 = W h y h 3 y_3 = W_{hy} h_3 y3=Whyh3
预测值 y ^ 3 = softmax ( y 3 ) \hat{y}_3 = \text{softmax}(y_3) y^3=softmax(y3)
假设真实输出为 ‘l’,对应 one-hot 编码为 y 3 = [ 0 , 0 , 1 , 0 ] y_3 = [0, 0, 1, 0] y3=[0,0,1,0]。
交叉熵损失函数:
$$
\text{loss}3 = - \sum{i} y_{3i} \log(\hat{y}_{3
i})
$$
梯度计算:
∂ loss 3 ∂ W h y = ( y ^ 3 − y 3 ) h 3 T \frac{\partial \text{loss}_3}{\partial W_{hy}} = (\hat{y}_3 - y_3) h_3^T ∂Why∂loss3=(y^3−y3)h3T
∂ loss 3 ∂ W x h = ∂ loss 3 ∂ h 3 ⋅ ∂ h 3 ∂ W x h \frac{\partial \text{loss}_3}{\partial W_{xh}} = \frac{\partial \text{loss}_3}{\partial h_3} \cdot \frac{\partial h_3}{\partial W_{xh}} ∂Wxh∂loss3=∂h3∂loss3⋅∂Wxh∂h3
∂ loss 3 ∂ W h h = ∂ loss 3 ∂ h 3 ⋅ ∂ h 3 ∂ W h h \frac{\partial \text{loss}_3}{\partial W_{hh}} = \frac{\partial \text{loss}_3}{\partial h_3} \cdot \frac{\partial h_3}{\partial W_{hh}} ∂Whh∂loss3=∂h3∂loss3⋅∂Whh∂h3
参数更新:
W x h = W x h − η ∂ loss 3 ∂ W x h W_{xh} = W_{xh} - \eta \frac{\partial \text{loss}_3}{\partial W_{xh}} Wxh=Wxh−η∂Wxh∂loss3
W h h = W h h − η ∂ loss 3 ∂ W h h W_{hh} = W_{hh} - \eta \frac{\partial \text{loss}_3}{\partial W_{hh}} Whh=Whh−η∂Whh∂loss3
W h y = W h y − η ∂ loss 3 ∂ W h y W_{hy} = W_{hy} - \eta \frac{\partial \text{loss}_3}{\partial W_{hy}} Why=Why−η∂Why∂loss3
使用更新后的 W x h W_{xh} Wxh、 W h h W_{hh} Whh 和 W h y W_{hy} Why 参数。
输入向量 x 4 = [ 0 , 0 , 1 , 0 ] x_4 = [0, 0, 1, 0] x4=[0,0,1,0]
h 4 = tanh ( W x h x 4 + W h h h 3 ) h_4 = \tanh(W_{xh} x_4 + W_{hh} h_3) h4=tanh(Wxhx4+Whhh3)
计算后得:
h
4
=
tanh
(
(
0.5
1.2
)
+
W
h
h
h
3
)
h_4 = \tanh \left(
y 4 = W h y h 4 y_4 = W_{hy} h_4 y4=Whyh4
预测值 y ^ 4 = softmax ( y 4 ) \hat{y}_4 = \text{softmax}(y_4) y^4=softmax(y4)
假设真实输出为 ‘o’,对应 one-hot 编码为 y 4 = [ 0 , 0 , 0 , 1 ] y_4 = [0, 0, 0, 1] y4=[0,0,0,1]。
交叉熵损失函数:
loss 4 = − ∑ i y 4 i log ( y ^ 4 i ) \text{loss}_4 = - \sum_{i} y_{4i} \log(\hat{y}_{4i}) loss4=−i∑y4ilog(y^4i)
梯度计算:
∂ loss 4 ∂ W h y = ( y ^ 4 − y 4 ) h 4 T \frac{\partial \text{loss}_4}{\partial W_{hy}} = (\hat{y}_4 - y_4) h_4^T ∂Why∂loss4=(y^4−y4)h4T
∂ loss 4 ∂ W x h = ∂ loss 4 ∂ h 4 ⋅ ∂ h 4 ∂ W x h \frac{\partial \text{loss}_4}{\partial W_{xh}} = \frac{\partial \text{loss}_4}{\partial h_4} \cdot \frac{\partial h_4}{\partial W_{xh}} ∂Wxh∂loss4=∂h4∂loss4⋅∂Wxh∂h4
∂ loss 4 ∂ W h h = ∂ loss 4 ∂ h 4 ⋅ ∂ h 4 ∂ W h h \frac{\partial \text{loss}_4}{\partial W_{hh}} = \frac{\partial \text{loss}_4}{\partial h_4} \cdot \frac{\partial h_4}{\partial W_{hh}} ∂Whh∂loss4=∂h4∂loss4⋅∂Whh∂h4
参数更新:
W x h = W x h − η ∂ loss 4 ∂ W x h W_{xh} = W_{xh} - \eta \frac{\partial \text{loss}_4}{\partial W_{xh}} Wxh=Wxh−η∂Wxh∂loss4
W h h = W h h − η ∂ loss 4 ∂ W h h W_{hh} = W_{hh} - \eta \frac{\partial \text{loss}_4}{\partial W_{hh}} Whh=Whh−η∂Whh∂loss4
W h y = W h y − η ∂ loss 4 ∂ W h y W_{hy} = W_{hy} - \eta \frac{\partial \text{loss}_4}{\partial W_{hy}} Why=Why−η∂Why∂loss4
下面是一个使用 PyTorch 实现简单 RNN(循环神经网络)的示例代码,该代码将字符序列作为输入并预测下一个字符。我们将使用一个小的字符集进行演示。
在开始之前,请确保您已安装 PyTorch。您可以使用以下命令进行安装:
pip install torch
我们将实现一个字符级 RNN,用于从序列 “hello” 中预测下一个字符。字符集为 {‘h’, ‘e’, ‘l’, ‘o’}。
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
# 定义字符集和字符到索引的映射
chars = ['h', 'e', 'l', 'o']
char_to_idx = {ch: idx for idx, ch in enumerate(chars)}
idx_to_char = {idx: ch for idx, ch in enumerate(chars)}
# 超参数
input_size = len(chars)
hidden_size = 10
output_size = len(chars)
num_layers = 1
learning_rate = 0.01
num_epochs = 100
# 准备数据
def char_to_tensor(char):
tensor = torch.zeros(input_size)
tensor[char_to_idx[char]] = 1.0
return tensor
def string_to_tensor(string):
tensor = torch.zeros(len(string), input_size)
for idx, char in enumerate(string):
tensor[idx][char_to_idx[char]] = 1.0
return tensor
input_seq = "hell"
target_seq = "ello"
input_tensor = string_to_tensor(input_seq)
target_tensor = torch.tensor([char_to_idx[ch] for ch in target_seq])
# 定义 RNN 模型
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.hidden_size = hidden_size
self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x, hidden):
out, hidden = self.rnn(x, hidden)
out = self.fc(out[:, -1, :])
return out, hidden
def init_hidden(self):
return torch.zeros(num_layers, 1, hidden_size)
# 初始化模型、损失函数和优化器
model = RNN(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# 训练模型
for epoch in range(num_epochs):
hidden = model.init_hidden()
model.zero_grad()
input_seq = input_tensor.unsqueeze(0)
output, hidden = model(input_seq, hidden)
loss = criterion(output, target_tensor.unsqueeze(0))
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')
# 测试模型
def predict(model, char, hidden=None):
if hidden is None:
hidden = model.init_hidden()
input_tensor = char_to_tensor(char).unsqueeze(0).unsqueeze(0)
output, hidden = model(input_tensor, hidden)
_, predicted_idx = torch.max(output, 1)
return idx_to_char[predicted_idx.item()], hidden
hidden = model.init_hidden()
input_char = 'h'
predicted_seq = input_char
for _ in range(len(input_seq)):
next_char, hidden = predict(model, input_char, hidden)
predicted_seq += next_char
input_char = next_char
print(f'Predicted sequence: {predicted_seq}')
数据准备:
char_to_tensor 函数将字符转换为 one-hot 向量。string_to_tensor 函数将字符串转换为一系列 one-hot 向量。定义 RNN 模型:
RNN 类继承自 nn.Module,包含一个 RNN 层和一个全连接层。forward 方法执行前向传播。init_hidden 方法初始化隐藏状态。训练模型:
测试模型:
predict 函数根据给定的输入字符生成下一个字符。运行该代码后,您将看到模型预测的字符序列,它会逐渐学会从输入序列中预测下一个字符。