NLLLOSS & CrossEntropyLoss

今天在看论文的时候，看到了NLLLOSS函数，嗯？这是个啥，然后就查了查，原来是跟CrossEntropyLoss一样的，这里整理一下，方便以后查阅。

NLLLOSS & CrossEntropyLoss

NLLLOSS
CrossEntropyLoss
LogSoftmax
Softmax
Sigmoid
再谈CrossEntropyLoss
- 线性回归中的损失函数MSN
- CrossEntropy

NLLLOSS

官方API解释：
https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html?highlight=nllloss#torch.nn.NLLLoss

在这里插入图片描述

The negative log likelihood loss. It is useful to train a classification problem with C classes.

在这里插入图片描述

The unreduced (i.e. with reduction set to ‘none’) loss can be described as:

在这里插入图片描述
其中 $x$ 是输入，也就是上文的“log-probabilities”； $y$ 是标签，这里的输入的标签并不是one-hot，而是一个索引； $w$ 是权重，就如上面所说的，是针对不平衡数据集的； $N$ 是batch size。

对公式中 $x_{n, y_n}$ 的理解：
输入 $x$ 的尺寸是 $(N, C)$ ,则下标 $n$ 代表一个batch里面的第 $n$ 个样本；下标 $y_n$ 代表的 target的第 $n$ 个数据，也就是在一个batch中的第 $n$ 个样本的标签，作为 $x$ 的下标取的是第 $y_n$ 列的数据，也就是第 $y_n$ 类的预测概率。

If reduction is not ‘none’ (default ‘mean’), then

在这里插入图片描述

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
# input is of size N x C = 3 x 5
input = torch.randn(3, 5, requires_grad=True)
# each element in target has to have 0 <= value < C
target = torch.tensor([1, 0, 4]) # 生成三个样本的标签，标签值小于类别数
output = loss(m(input), target)
output.backward()

# 2D loss example (used, for example, with image inputs)
N, C = 5, 4
loss = nn.NLLLoss()
# input is of size N x C x height x width
data = torch.randn(N, 16, 10, 10)
conv = nn.Conv2d(16, C, (3, 3)) # output:[N, 4, 8, 8]
m = nn.LogSoftmax(dim=1)
# each element in target has to have 0 <= value < C
target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C) # [N, 8, 8]
output = loss(m(conv(data)), target)
output.backward()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

CrossEntropyLoss

官方API解释：
https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss
在这里插入图片描述

在这里插入图片描述

This criterion computes the cross entropy loss between input and target.

在这里插入图片描述

这一部分的介绍跟NLLLOSS是类似的。下面的介绍就跟NLLLOSS有区别了。针对输入的 target 得不同，分了两种情况介绍。

The target that this criterion expects should contain either:

在这里插入图片描述

Note that this case is equivalent to the combination of LogSoftmax and NLLLoss.

这种情况输入的target是个索引，跟NLLLOSS的target是一样的，所以说此时的CrossEntropyLoss相当于Softmax+Log+NLLLoss。

在这里插入图片描述
这种情况输入的target是预测值，也就是经过softmax之后的值，这里就跟NLLLOSS不一样了。

# Example of target with class indices
loss = nn.CrossEntropyLoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
output = loss(input, target)
output.backward()

# Example of target with class probabilities
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5).softmax(dim=1)
output = loss(input, target)
output.backward()
1
2
3
4
5
6
7
8
9
10
11
12

LogSoftmax

官方API解释：
https://pytorch.org/docs/stable/generated/torch.nn.LogSoftmax.html#torch.nn.LogSoftmax

在这里插入图片描述
Applies the $\log(\text{Softmax}(x))$ function to an n-dimensional input Tensor. The LogSoftmax formulation can be simplified as:

在这里插入图片描述

m = nn.LogSoftmax()
input = torch.randn(2, 3)
output = m(input)
1
2
3

Softmax

官方API解释：
https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html?highlight=softmax#torch.nn.Softmax

在这里插入图片描述

Applies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range [0,1] and sum to 1.

Softmax is defined as:
在这里插入图片描述

m = nn.Softmax(dim=1)
input = torch.randn(2, 3)
output = m(input)
1
2
3

Sigmoid

官方API解释：
https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html?highlight=sigmoid#torch.nn.Sigmoid
在这里插入图片描述

Applies the element-wise function:

在这里插入图片描述

m = nn.Sigmoid()
input = torch.randn(2)
output = m(input)
1
2
3

再谈CrossEntropyLoss

交叉熵（cross entropy）是深度学习中常用的一个概念，一般用来求目标与预测值之间的差距。

线性回归中的损失函数MSN

在线性回归问题中，常常使用MSE（Mean Squared Error）作为loss函数，比如：

在这里插入图片描述
这里的m表示m个样本的，loss为m个样本的loss均值。MSE在线性回归问题中比较好用，那么在逻辑分类问题中还是如此么？

为什么线性回归任务中不用MSN方法呢？

主要原因是在分类问题中，使用sigmoid/softmx得到概率，配合MSE损失函数时，采用梯度下降法进行学习时，会出现模型一开始训练时(这时候的概率往往很低)，学习速率非常慢的情况。
因为回归问题要求拟合实际的值，通过MSE衡量预测值和实际值之间的误差，可以通过梯度下降的方法来优化。而不像分类问题，需要一系列的激活函数（sigmoid、softmax）来将预测值映射到0-1之间。
如果分类任务使用sigmoid，当输出是0或1的值时，梯度接近于0，也出现了梯度消失现象。