总结训练误差(训练损失)和泛化误差(验证损失)曲线:
# 通过多样式拟合来观察过拟合和欠拟合现象
import math
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l
使用三阶多项式生成训练数据集的标签和测试数据集的标签
y
=
1.2
x
−
3.4
x
2
2
!
+
5.6
x
3
3
!
+
ϵ
,
w
h
e
r
e
ϵ
∼
N
(
0
,
0.
1
2
)
y=1.2x-3.4\frac{x^2}{2!}+5.6\frac{x^3}{3!}+\epsilon\ ,where\ \epsilon\ \sim N(0,0.1^2)
y=1.2x−3.42!x2+5.63!x3+ϵ ,where ϵ ∼N(0,0.12)
,
ϵ
\epsilon
ϵ是噪音
# 多项式的最大阶数
max_degree = 20
# n_test是validation
# 训练和测试数据集大小
n_train, n_test = 100, 100
true_w = np.zeros(max_degree)
# 生成的数据集只有前面4列是有权重的正常的数据,后面1列的数据是噪音
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])
features = np.random.normal(size=(n_train + n_test, 1))
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
for i in range(max_degree):
poly_features[:, i] /= math.gamma(i + 1) # `gamma(n)` = (n-1)!
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)
# 查看生成数据
# numpy的ndarray转tensor
true_w, features, poly_features, labels = [torch.tensor(x, dtype=d2l.float32) for x in [true_w, features, poly_features, labels]]
features[:2], poly_features[:2, :], labels[:2]
(tensor([[-0.7365],
[ 0.5832]]),
tensor([[ 1.0000e+00, -7.3652e-01, 2.7123e-01, -6.6588e-02, 1.2261e-02,
-1.8061e-03, 2.2170e-04, -2.3327e-05, 2.1476e-06, -1.7575e-07,
1.2944e-08, -8.6669e-10, 5.3194e-11, -3.0137e-12, 1.5855e-13,
-7.7849e-15, 3.5836e-16, -1.5526e-17, 6.3527e-19, -2.4626e-20],
[ 1.0000e+00, 5.8318e-01, 1.7005e-01, 3.3057e-02, 4.8196e-03,
5.6214e-04, 5.4638e-05, 4.5520e-06, 3.3183e-07, 2.1502e-08,
1.2540e-09, 6.6481e-11, 3.2309e-12, 1.4494e-13, 6.0375e-15,
2.3473e-16, 8.5557e-18, 2.9350e-19, 9.5092e-21, 2.9187e-22]]),
tensor([2.8299, 5.2456]))
# 评估平均损失
def evaluate_loss(net, data_iter, loss):
metric = d2l.Accumulator(2)
for X, y in data_iter:
out = net(X)
y = y.reshape(out.shape)
l = loss(out, y)
metric.add(l.sum(), l.numel())
return metric[0] / metric[1]
# 训练
def train(train_features, test_features, train_labels, test_labels, num_epochs=800):
# 损失函数
loss = nn.MSELoss()
# 模型
input_shape = train_features.shape[-1]
# 输出层1个神经元,不使用偏置
net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
batch_size = min(10, train_labels.shape[0])
# 数据
train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)), batch_size)
test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)), batch_size, is_train=False)
# 优化函数
trainer = torch.optim.SGD(net.parameters(), lr=0.01)
# 动画展示
animator = d2l.Animator(xlabel='epoch',
ylabel='loss', yscale='log',
xlim=[1, num_epochs], ylim=[1e-3, 1e2],
legend=['train', 'test'])
# 开始训练
for epoch in range(num_epochs):
d2l.train_epoch_ch3(net, train_iter, loss, trainer)
if epoch == 0 or (epoch + 1) % 20 == 0:
animator.add(epoch + 1,
(evaluate_loss(net, train_iter, loss),
evaluate_loss(net, test_iter, loss)))
print('weight:', net[0].weight.data.numpy())
展示预测函数的拟合情况
恰当拟合
多样式函数拟合
ploy_features选择前4个特征数据,即
1
,
x
,
x
2
2
!
,
x
3
3
!
1, x, \frac{x^2}{2!}, \frac{x^3}{3!}
1,x,2!x2,3!x3
# train_features选择前面n_train行的数据,test_features选择第n_train行到最后的数据
print(train(poly_features[:n_train,:4],poly_features[n_train:,:4],labels[:n_train],labels[n_train:]))
weight: [[ 5.0088735 1.2420155 -3.4160125 5.509955 ]]
None
欠拟合
线性函数拟合
ploy_features选择前2个特征数据,即
1
,
x
1, x
1,x
print(train(poly_features[:n_train,:2],poly_features[n_train:,:2],labels[:n_train],labels[n_train:]))
weight: [[3.2691808 4.0887265]]
None
过拟合
多项式函数拟合
ploy_features选择所有特征数据,即
1
,
x
,
x
2
2
!
,
x
3
3
!
,
ϵ
(
噪音项
)
1, x, \frac{x^2}{2!}, \frac{x^3}{3!},\epsilon(噪音项)
1,x,2!x2,3!x3,ϵ(噪音项)
print(train(poly_features[:n_train,:],poly_features[n_train:,:],labels[:n_train],labels[n_train:]))
weight: [[ 4.9652715 1.32939 -3.2067175 5.0756216 -0.6167561 1.0779084
-0.19108813 0.21295705 -0.16514365 0.1888543 -0.10817137 0.11056256
0.19794509 0.07623929 0.1727855 0.21495496 -0.03231318 -0.01818779
0.00740698 0.01728407]]
None
todo(test loss并没有先降低再升高)
query
svm对比神经网络的缺点
模型剪枝和蒸馏(去粗取精distillation)的作用
训练集,验证集,测试集的划分标准
数据具有时序序列,比如股票图片
训练集合验证集需要一起进行“数据清洗(异常值处理)”和“特征构建(标准化(均值归一化))”吗?
数据集不够大的情况下才做k折交叉验证
超参数的设计和选择
数据集类型不平衡。例如,二分类问题A类样本1个,B类样本9个
k折交叉验证的目的是确定超参数,然后
SVM比MLP流行,CNN又比SVM流行
随机森林在深度学习中的应用
打比赛
k折交叉验证
神经网络是一种语言
同样的模型,同样的数据,随机初始化不同,将训练处理的模型集成效果会很好
数据集中的噪音越少越好,清除噪音