• 两层全连接网络反向传播梯度推导(矩阵形式、sigmoid、最小均方差MSE)


    Solving for Derivatives

    虽然正文用了英文(写正文和敲公式的中英文输入法互换太折磨了-_-),但是都是贼简单的表述,希望能读下去,更希望能对您有帮助!

    h = X W 1 + b 1 h s i g m o i d = s i g m o i d ( h ) Y p r e d = h s i g m o i d W 2 + b 2 f = ∣ ∣ Y − Y p r e d ∣ ∣ F 2 (1) h=XW1+b1hsigmoid=sigmoid(h)Ypred=hsigmoidW2+b2f=||YYpred||2F \tag{1} hhsigmoidYpredf=XW1+b1=sigmoid(h)=hsigmoidW2+b2=∣∣YYpredF2(1)

    Solve for the derivatives of the following variables.
    ∂ f ∂ W 2     ∂ f ∂ b 2     ∂ f ∂ W 1     ∂ f ∂ b 1 (2) \frac{\partial f}{\partial W_2} \,\,\, \frac{\partial f}{\partial b_2} \,\,\, \frac{\partial f}{\partial W_1} \,\,\, \frac{\partial f}{\partial b_1} \tag{2} W2fb2fW1fb1f(2)

    The derivation process of derivative

    f = ∣ ∣ Y − Y p r e d ∣ ∣ F 2 = t r ( ( Y − Y p r e d ) T ( Y − Y p r e d ) ) (3) f = ||Y-Y_{pred}||^2_F = tr((Y-Y_{pred})^T(Y-Y_{pred})) \tag{3} f=∣∣YYpredF2=tr((YYpred)T(YYpred))(3)

    d f = d { t r [ ( Y − Y p r e d ) T ( Y − Y p r e d ) ] } = t r { d [ ( Y − Y p r e d ) T ( Y − Y p r e d ) ] } = t r { [ d ( Y − Y p r e d ) T ] ( Y − Y p r e d ) + ( Y − Y p r e d ) T d ( Y − Y p r e d ) } = t r [ − ( d Y p r e d T ) ( Y − Y p r e d ) − ( Y − Y p r e d ) T d Y p r e d ] = 2 t r [ ( Y p r e d − Y ) T d Y p r e d ] (4) df=d{tr[(YYpred)T(YYpred)]}=tr{d[(YYpred)T(YYpred)]}=tr{[d(YYpred)T](YYpred)+(YYpred)Td(YYpred)}=tr[(dYTpred)(YYpred)(YYpred)TdYpred]=2tr[(YpredY)TdYpred] \tag{4} df=d{tr[(YYpred)T(YYpred)]}=tr{d[(YYpred)T(YYpred)]}=tr{[d(YYpred)T](YYpred)+(YYpred)Td(YYpred)}=tr[(dYpredT)(YYpred)(YYpred)TdYpred]=2tr[(YpredY)TdYpred](4)

    where, − ( d Y p r e d T ) ( Y − Y p r e d ) -(dY_{pred}^T)(Y-Y_{pred}) (dYpredT)(YYpred) is a scalar, so it is equivalent to − ( Y − Y p r e d ) d Y p r e d T -(Y-Y_{pred})dY_{pred}^T (YYpred)dYpredT.

    According to the relationship between gradient and differential (The relationship between matrix differentiation and derivatives), we can obtain the result of ∂ f ∂ Y p r e d \frac{\partial f}{\partial Y_{pred}} Ypredf as follows.
    ∂ f ∂ Y p r e d = 2 ( Y p r e d − Y ) (5) \frac{\partial f}{\partial Y_{pred}} = 2(Y_{pred} - Y) \tag{5} Ypredf=2(YpredY)(5)

    ∂ f ∂ b 2 = ∂ f ∂ Y p r e d ∂ Y p r e d ∂ b 1 = 2 ( Y p r e d − Y )   1 (6) \frac{\partial f}{\partial b_2} = \frac{\partial f}{\partial Y_{pred}} \frac{\partial Y_{pred}}{\partial b_1} = 2(Y_{pred} - Y) \, \boldsymbol{1} \tag{6} b2f=Ypredfb1Ypred=2(YpredY)1(6)

    where, 1 is a column vector of the shape h i d d e n × 1 hidden \times 1 hidden×1.

    d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T h s i g m o i d d W 2 ) = t r ( ( h s i g m o i d T ∂ f ∂ Y p r e d ) T d W 2 ) (7) df=tr(fYpredTdY)=tr(fYpredThsigmoiddW2)=tr((hTsigmoidfYpred)TdW2) \tag{7} df=tr(YpredfTdY)=tr(YpredfThsigmoiddW2)=tr((hsigmoidTYpredf)TdW2)(7)

    where, the derivation process of d Y \mathrm{d}Y dY is shown in Eq.(8).
    d Y = d ( h s i g m o i d W 2 + b 2 ) = d ( h s i g m o i d W 2 ) = ( d h s i g m o i d ) W 2 + h s i g m o i d d W 2 = h s i g m o i d d W 2 (8) dY=d(hsigmoidW2+b2)=d(hsigmoidW2)=(dhsigmoid)W2+hsigmoiddW2=hsigmoiddW2 \tag{8} dY=d(hsigmoidW2+b2)=d(hsigmoidW2)=(dhsigmoid)W2+hsigmoiddW2=hsigmoiddW2(8)
    where, h s i g m o i d h_{sigmoid} hsigmoid is not a function of W 2 W_2 W2.

    ∂ f ∂ W 2 = h s i g m o i d T ∂ f ∂ Y p r e d = 2 h s i g m o i d T ( Y p r e d − Y ) (9) \frac{\partial f}{\partial W_2} = h_{sigmoid}^T \frac{\partial f}{\partial Y_{pred}} = 2 h_{sigmoid}^T (Y_{pred} - Y) \tag{9} W2f=hsigmoidTYpredf=2hsigmoidT(YpredY)(9)

    h s i g m o i d = s i g m o i d ( h ) = 1 1 + e − h (10) h_{sigmoid} = sigmoid(h) = \frac{1}{1+e^{-h}} \tag{10} hsigmoid=sigmoid(h)=1+eh1(10)

    d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T ( d h s i g m o i d ) W 2 ) = t r ( W 2 ∂ f ∂ Y p r e d T d h s i g m o i d ) = t r { [ ∂ f ∂ Y p r e d W 2 T ] T ( h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∘ d h ) } = t r { [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d h } (11) df=tr(fYpredTdY)=tr(fYpredT(dhsigmoid)W2)=tr(W2fYpredTdhsigmoid)=tr{[fYpredWT2]T(hsigmoid(1hsigmoid)dh)}=tr{[fYpredWT2hsigmoid(1hsigmoid)]Tdh} \tag{11} df=tr(YpredfTdY)=tr(YpredfT(dhsigmoid)W2)=tr(W2YpredfTdhsigmoid)=tr{[YpredfW2T]T(hsigmoid(1hsigmoid)dh)}=tr{[YpredfW2Thsigmoid(1hsigmoid)]Tdh}(11)

    where, the derivation process of s i g m o i d sigmoid sigmoid is shown in Eq.(12), and the derivation of the fourth to fifth steps in Eq.(11) is based on t r ( A T ( B ∘ C ) ) = t r ( ( A ∘ B ) T C ) tr(A^T(B\circ C)) = tr((A \circ B)^T C) tr(AT(BC))=tr((AB)TC)

    d h s i g m o i d = h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∘ d h (12) \mathrm{d} h_{sigmoid} = h_{sigmoid} \circ (1-h_{sigmoid}) \circ \mathrm{d}h \tag{12} dhsigmoid=hsigmoid(1hsigmoid)dh(12)

    ∂ f ∂ b 1 = ∂ f ∂ h ∂ h ∂ b 1 = ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d )   1 = 2 ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d )   1 (13) \tag{13} b1f=hfb1h=YpredfW2Thsigmoid(1hsigmoid)1=2(YpredY)W2Thsigmoid(1hsigmoid)1(13)

    where, 1 is a column vector of the shape h i d d e n × 1 hidden \times 1 hidden×1.

    d f = t r ( ∂ f ∂ Y p r e d T d Y ) = t r ( ∂ f ∂ Y p r e d T ( d h s i g m o i d ) W 2 ) = t r ( W 2 ∂ f ∂ Y p r e d T d h s i g m o i d ) = t r ( [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d h ) = t r ( [ ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T X d W 1 ) = t r { [ X T ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ] T d W 1 } (14) \tag{14} df=tr(YpredfTdY)=tr(YpredfT(dhsigmoid)W2)=tr(W2YpredfTdhsigmoid)=tr([YpredfW2Thsigmoid(1hsigmoid)]Tdh)=tr([YpredfW2Thsigmoid(1hsigmoid)]TXdW1)=tr{[XTYpredfW2Thsigmoid(1hsigmoid)]TdW1}(14)

    ∂ f ∂ W 1 = X T ∂ f ∂ Y p r e d W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) = 2 X T ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) (15) \tag{15} W1f=XTYpredfW2Thsigmoid(1hsigmoid)=2XT(YpredY)W2Thsigmoid(1hsigmoid)(15)

    In summary, the derivative expressions of each variable are as follows.

    ∂ f ∂ W 2 = 2 h s i g m o i d T ( Y p r e d − Y ) ∂ f ∂ b 2 = 2 ( Y p r e d − Y )   1 ∂ f ∂ W 1 = 2 X T ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d ) ∂ f ∂ b 1 = 2 ( Y p r e d − Y ) W 2 T ∘ h s i g m o i d ∘ ( 1 − h s i g m o i d )   1 (16) \tag{16} W2fb2fW1fb1f=2hsigmoidT(YpredY)=2(YpredY)1=2XT(YpredY)W2Thsigmoid(1hsigmoid)=2(YpredY)W2Thsigmoid(1hsigmoid)1(16)

    Reference formula

    Basic differential formula

    d ( X ± Y ) = d X ± d Y d ( X Y ) = ( d X ) Y + X d Y d ( X T ) = ( d X ) T d t r ( X ) = t r ( d X ) (17) \tag{17} d(X±Y)d(XY)d(XT)dtr(X)=dX±dY=(dX)Y+XdY=(dX)T=tr(dX)(17)

    Element-wise formula

    d ( X ∘ Y ) = d X ∘ Y + X d ∘ Y d σ ( X ) = σ ′ ( X ) ∘ d X (18) \tag{18} d(XY)dσ(X)=dXY+XdY=σ(X)dX(18)

    where, σ \sigma σ is a element-wise function, and σ ′ ( X ) \sigma^{\prime}(X) σ(X) is the element-wise derivative. You can refer to the following example. Note that ∘ \circ means element-wise multiplication, i.e., Hadamard product ,which can also be denoted as ⊙ \odot .

    X = [ X 11 X 12 X 21 X 22 ] d s i n ( X ) = [ c o s ( X 11 ) d X 11 c o s ( X 12 ) d X 12 c o s ( X 21 ) d X 21 c o s ( X 22 ) d X 22 ] = c o s ( X ) ∘ d X (19) \tag{19} Xdsin(X)=[X11X21X12X22]=[cos(X11)dX11cos(X21)dX21cos(X12)dX12cos(X22)dX22]=cos(X)dX(19)

    The properties of the trace of matrix

    a = t r ( a ) t r ( A T ) = t r ( A ) t r ( A ± B ) = t r ( A ) ± t r ( B ) t r ( A B ) = t r ( B A ) t r ( A T B ) = t r ( B T A ) = ∑ i , j A i j B i j t r ( A T ( B ∘ C ) ) = t r ( ( A ∘ B ) T C ) = ∑ i , j A i j B i j C i j (20) \tag{20} atr(AT)tr(A±B)tr(AB)tr(ATB)tr(AT(BC))=tr(a)=tr(A)=tr(A)±tr(B)=tr(BA)=tr(BTA)=i,jAijBij=tr((AB)TC)=i,jAijBijCij(20)

    where, a a a is a scalar, A A A and B T B^T BT have the same shape in the forth equation of Eq., A A A and B B B and C C C have the same shape in the sixth equation of Eq… Notice here A A A and B B B have the same shape in the fifth equation of Eq.(20), which is different to the forth equation of Eq.(20).

    Derivative of assuming input is a matrix

    Let σ \sigma σ: R m × n → R m × n \mathbb{R}^{m \times n} \rightarrow \mathbb{R}^{m \times n} Rm×nRm×n apply the s i g m o i d sigmoid sigmoid function to each element.

    σ ( X ) = 1 1 − e x p ( − X ) (21) \sigma(X) = \frac{1}{1-exp(-X)} \tag{21} σ(X)=1exp(X)1(21)

    d σ ( X ) = − 1 [ e x p ( − X ) ∘ d ( − X ) ] ( 1 + e x p ( − X ) ) 2 = − 1 [ e x p ( − X ) ∘ ( − 1 ) ∘ d X ] ( 1 + e x p ( − X ) ) 2 = 1 ∘ e x p ( − X ) ∘ d X ( 1 + e x p ( − X ) ) 2 = 1 1 − e x p ( − X ) ∘ e x p ( − X ) + 1 − 1 1 + e x p ( − X ) ∘ d X = σ ( X ) ∘ ( 1 − σ ( X ) ) ∘ d X (22) \tag{22} dσ(X)=(1+exp(X))21[exp(X)d(X)]=(1+exp(X))21[exp(X)(1)dX]=(1+exp(X))21exp(X)dX=1exp(X)11+exp(X)exp(X)+11dX=σ(X)(1σ(X))dX(22)

    where, 1 1 1 and 1 \boldsymbol{1} 1 are both matrices of the same shape as X X X.

    Differentiation and derivatives

    Derivative of scalar to scalar

    d f = f ′ ( x ) d x (23) \mathrm{d}f = f^{\prime}(x) \mathrm{d}x \tag{23} df=f(x)dx(23)

    Derivative of scalar to vector (Multivariate Differential)

    d f = ∑ i = 1 n ∂ f ∂ x i d x i = ∂ f ∂ x T d x (24) \mathrm{d} f = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \mathrm{d} x_i = \frac{\partial f}{\partial x}^T \mathrm{d} x \tag{24} df=i=1nxifdxi=xfTdx(24)

    As shown in Eq.(24), total differential d f \mathrm{d} f df is the inner product of the gradient vector ∂ f ∂ x ( n × 1 ) \frac{\partial f}{\partial x} (n \times 1) xf(n×1) and the differential vector d x ( n × 1 ) \mathrm{d}x (n \times 1) dx(n×1). The first equal sign is the total differential formula, and the second equal sign is the relationship of gradient and differential.

    The relationship between matrix differentiation and derivatives

    d f = ∑ i = 1 m ∑ j = 1 n ∂ f ∂ X i j d X i j = t r ( ∂ f ∂ X T d X ) (25) \mathrm{d} f = \sum_{i=1}^{m}{\sum_{j=1}^{n}{\frac{\partial f}{\partial X_{ij}} \mathrm{d} X_{ij}}} = tr \left (\frac{\partial f}{\partial X}^T \mathrm{d}X \right ) \tag{25} df=i=1mj=1nXijfdXij=tr(XfTdX)(25)

    where, the second equal sign refers to the fifth equation of Eq.(20).

  • 相关阅读:
    数据分析-相关性
    Spring Security(安全框架)
    HTTP协议和抓包工具Fiddler
    android FileOutputStream 写入文件,但是文件大小为空
    USB转串口芯片沁恒微CH9340
    2022年9月8号Java23设计模式学习(课时三)抽象工厂模式
    金仓数据库 KingbaseGIS 使用手册(4. 数据管理和查询)
    机器学习练习二——GAN算法生成图像
    Java技能树-网络-HTTP-HttpURLConnection
    MongoDB副本集调整节点
  • 原文地址:https://blog.csdn.net/qq_43561370/article/details/127585541