向量模求导: 定义 x ∈ R 1 × d \bm{x}\in \mathbb{R}^{1\times d} x∈R1×d为一个列向量, ∣ ∣ x ∣ ∣ ||\bm{x}|| ∣∣x∣∣为向量的模, x ^ \hat{\bm{x}} x^表示经过L2归一化之后的向量,长度为1
∂ ∂ x ( ∣ ∣ x ∣ ∣ ) = ∂ ∂ x x T x = 1 2 2 x T x T x = x T ∣ ∣ x ∣ ∣ = x ^ T \frac{\partial}{\partial \bm{x}}( ||\bm{x}||)=\frac{\partial }{\partial \bm{x}}\sqrt{\bm{x}^T \bm{x}}=\frac{1}{2}\frac{2\bm{x}^T}{\sqrt{\bm{x}^T \bm{x}}}=\frac{\bm{x}^T}{ ||\bm{x}||}=\hat{\bm{x}}^T ∂x∂(∣∣x∣∣)=∂x∂xTx=21xTx2xT=∣∣x∣∣xT=x^T
归一化向量求导:
∂
∂
x
(
x
^
)
=
∂
∂
x
(
x
∣
∣
x
∣
∣
)
=
∂
∂
x
(
x
T
1
∣
∣
x
∣
∣
)
=
1
∣
∣
x
∣
∣
∂
∂
x
(
x
)
+
x
∂
∂
x
(
1
∣
∣
x
∣
∣
)
=
1
∣
∣
x
∣
∣
−
x
x
^
T
∣
∣
x
∣
∣
2
=
∣
∣
x
∣
∣
I
−
x
x
^
T
∣
∣
x
∣
∣
2
=
I
−
x
^
x
^
T
∣
∣
x
∣
∣
\frac{\partial}{\partial \bm{x}}(\hat{\bm{x}})=\frac{\partial}{\partial \bm{x}}(\frac{\bm{x}}{ ||\bm{x}||})=\frac{\partial}{\partial \bm{x}}(\bm{x}^T\frac{1}{ ||\bm{x}||})\\ =\frac{1}{ ||\bm{x}||}\frac{\partial}{\partial \bm{x}}(\bm{x})+\bm{x}\frac{\partial}{\partial \bm{x}}(\frac{1}{ ||\bm{x}||})\\ =\frac{1}{ ||\bm{x}||}-\bm{x}\frac{\hat{\bm{x}}^T}{||\bm{x}||^2}\\ =\frac{||\bm{x}||I-\bm{x}\hat{\bm{x}}^T}{||\bm{x}||^2}\\ =\frac{I-\hat{\bm{x}}\hat{\bm{x}}^T}{||\bm{x}||}
∂x∂(x^)=∂x∂(∣∣x∣∣x)=∂x∂(xT∣∣x∣∣1)=∣∣x∣∣1∂x∂(x)+x∂x∂(∣∣x∣∣1)=∣∣x∣∣1−x∣∣x∣∣2x^T=∣∣x∣∣2∣∣x∣∣I−xx^T=∣∣x∣∣I−x^x^T
对于anchor向量
q
q
q,对比向量为
{
k
i
}
i
=
0
K
\{k_i\}_{i=0}^K
{ki}i=0K,其中
k
0
k_0
k0是正样本向量,
{
k
i
}
i
=
1
K
\{k_i\}_{i=1}^K
{ki}i=1K为K个负样本向量,基于交叉熵误差函数为:
L
=
−
log
exp
(
q
T
k
0
/
τ
)
∑
i
=
0
K
exp
(
q
T
k
i
/
τ
)
L=-\log\frac{\exp(q^Tk_0/\tau)}{\sum_{i=0}^K\exp(q^Tk_i/\tau)}
L=−log∑i=0Kexp(qTki/τ)exp(qTk0/τ)
这里,可以认为logits
z
=
[
q
T
k
0
/
τ
,
q
T
k
1
/
τ
,
⋯
,
q
T
k
K
/
τ
]
z=[q^Tk_0/\tau,q^Tk_1/\tau,\cdots,q^Tk_K/\tau]
z=[qTk0/τ,qTk1/τ,⋯,qTkK/τ],这里
τ
\tau
τ是一个温度系数,可以设为可学习变量,也可以设置为一个常数
根据交叉熵误差梯度计算规则(具体可以参考参考地址),可以得到:
∂
L
∂
z
i
=
p
i
−
y
i
\frac{\partial L}{\partial z_i}=p_i-y_i
∂zi∂L=pi−yi
其中
p
i
=
exp
(
q
T
k
i
/
τ
)
∑
i
=
0
K
exp
(
q
T
k
i
/
τ
)
p_i=\frac{\exp(q^Tk_i/\tau)}{\sum_{i=0}^K\exp(q^Tk_i/\tau)}
pi=∑i=0Kexp(qTki/τ)exp(qTki/τ)为softmax归一化之后的概率值。对于锚样本
q
q
q的梯度,计算为:
∂
L
∂
q
=
∑
i
=
0
K
(
∂
L
∂
z
i
∂
z
i
∂
q
)
=
∑
i
=
0
K
(
p
i
−
y
i
)
k
i
/
τ
\frac{\partial L}{\partial q}=\sum_{i=0}^{K}(\frac{\partial L}{\partial z_i}\frac{\partial z_i}{\partial q})=\sum_{i=0}^{K}(p_i-y_i)k_i/\tau
∂q∂L=i=0∑K(∂zi∂L∂q∂zi)=i=0∑K(pi−yi)ki/τ
对比向量梯度为:
∂
L
∂
k
i
=
∂
L
∂
z
i
∂
z
i
∂
k
i
=
(
p
i
−
y
i
)
q
/
τ
\frac{\partial L}{\partial k_i}=\frac{\partial L}{\partial z_i}\frac{\partial z_i}{\partial k_i}=(p_i-y_i)q/\tau
∂ki∂L=∂zi∂L∂ki∂zi=(pi−yi)q/τ
温度系数的梯度为:
∂
L
∂
τ
=
∑
i
=
0
K
∂
L
∂
z
i
∂
z
i
∂
τ
=
∑
i
=
0
K
(
p
i
−
y
i
)
q
T
k
i
(
−
1
/
τ
2
)
=
∑
i
=
0
K
(
y
i
−
p
i
)
z
i
/
τ
\frac{\partial L}{\partial \tau}=\sum_{i=0}^{K}\frac{\partial L}{\partial z_i}\frac{\partial z_i}{\partial \tau}=\sum_{i=0}^K(p_i-y_i)q^Tk_i(-1/\tau^2)=\sum_{i=0}^K(y_i-p_i)z_i/\tau
∂τ∂L=i=0∑K∂zi∂L∂τ∂zi=i=0∑K(pi−yi)qTki(−1/τ2)=i=0∑K(yi−pi)zi/τ
pytorch代码实现:
import torch
import torch.nn.functional as F
p = torch.tensor([[0.1, 0.2, 0.3],
[0.5, 0.6, 0.8]], requires_grad=True)
k = torch.tensor([[0.4, 0.5, 0.7],
[0.6, 0.8, 0.6],
[0.6, 0.8, 0.6],], requires_grad=True)
tau = torch.tensor(0.01, requires_grad=True)
targets = torch.tensor([0, 0])
class CrossEntropyLoss(torch.autograd.Function):
@staticmethod
def forward(ctx, p, k, tau, targets):
logits = p @ k.T / tau
targets = F.one_hot(targets, num_classes=logits.size(1)).float()
prob = F.softmax(logits, 1)
ctx.save_for_backward(logits, prob, targets, p, k, tau)
logits = F.log_softmax(logits, 1)
loss = -(targets * logits).sum(1).mean()
return loss
@staticmethod
def backward(ctx, grad_output):
logits, prob, targets, p, k, tau = ctx.saved_tensors
grad_p = grad_output * (prob - targets) @ k / tau / targets.size(0)
embed_size = p.size(1)
prob_targets_repeat = (prob - targets).t().repeat(1, embed_size).view(-1,embed_size, p.size(0))
grad_k = grad_output * (prob_targets_repeat * (p.t() / tau).unsqueeze(0)).sum(-1) / targets.size(0)
tau_grad = grad_output * torch.sum((targets - prob) * logits / tau, dim=1).mean()
grad_targets = None
return grad_p, grad_k, tau_grad, grad_targets
loss = CrossEntropyLoss.apply(p, k, tau, targets)
loss.backward()
print(p.grad)
print(k.grad)
print(tau.grad)
输出:
tensor([[ 9.9664, 14.9496, -4.9832],
[10.0000, 15.0000, -5.0000]])
tensor([[-29.9832, -39.9664, -54.9496],
[ 14.9916, 19.9832, 27.4748],
[ 14.9916, 19.9832, 27.4748]])
tensor(-1249.1605)
pytorch函数验证:
import torch
import torch.nn.functional as F
p = torch.tensor([[0.1, 0.2, 0.3],
[0.5, 0.6, 0.8]], requires_grad=True)
k = torch.tensor([[0.4, 0.5, 0.7],
[0.6, 0.8, 0.6],
[0.6, 0.8, 0.6],], requires_grad=True)
tau = torch.tensor(0.01, requires_grad=True)
logits = p @ k.T / tau
targets = torch.tensor([0, 0])
loss = F.cross_entropy(logits, targets)
loss.backward()
print(p.grad)
print(k.grad)
print(tau.grad)
输出:
tensor([[ 9.9664, 14.9496, -4.9832],
[10.0000, 15.0000, -5.0000]])
tensor([[-29.9832, -39.9664, -54.9496],
[ 14.9916, 19.9832, 27.4748],
[ 14.9916, 19.9832, 27.4748]])
tensor(-1249.1606)