版权声明:所有的说明性文档基于 Creative Commons 协议, 所有的代码基于 MIT 协议. All documents are licensed under the Creative Commons License, all codes are licensed under the MIT License. https://cloud.tencent.com/developer/article/1446395
摘要
本文求解 softmax cross-entropy 在反向传播中的梯度.
相关
配套代码, 请参考文章 :
Python和PyTorch对比实现多标签softmax cross-entropy交叉熵损失及反向传播
有关 softmax 的详细介绍, 请参考 :
softmax函数详解及反向传播中的梯度求导
有关 cross-entropy 的详细介绍, 请参考 :
通过案例详解cross-entropy交叉熵损失函数
系列文章索引 :
https://blog.csdn.net/oBrightLamp/article/details/85067981
正文
在大多数教程中, softmax 和 cross-entropy 总是一起出现, 求梯度的时候也是一起考虑.
softmax 和 cross-entropy 的梯度, 已经在上面的两篇文章中分别给出.
1. 题目
考虑一个输入向量 x, 经 softmax 函数归一化处理后得到向量 s 作为预测的概率分布, 已知向量 y 为真实的概率分布, 由 cross-entropy 函数计算得出误差值 error (标量 e ), 求 e 关于 x 的梯度.
x=(x1,x2,x3,⋯ ,xk)s=softmax(x)si=exi∑t=1kexte=crossEntropy(s,y)=−∑i=1kyilog(si) quad x = (x_1, x_2, x_3, cdots, x_k) quad s = softmax(x) quad s_{i} = frac{e^{x_{i}}}{ sum_{t = 1}^{k}e^{x_{t}}} quad e = crossEntropy(s, y) = -sum_{i = 1}^{k}y_{i}log(s_{i}) x=(x1,x2,x3,⋯,xk)s=softmax(x)si=∑t=1kextexie=crossEntropy(s,y)=−i=1∑kyilog(si)
已知 :
∇e(s)=∂e∂s=(∂e∂s1,∂e∂s2,⋯ ,∂e∂sk)=(−y1s1,−y2s2,⋯ ,−yksk) ∇s(x)=∂s∂x=(∂s1/∂x1∂s1/∂x2⋯∂s1/∂xk∂s2/∂x1∂s2/∂x2⋯∂s2/∂xk⋮⋮⋱⋮∂sk/∂x1∂sk/∂x2⋯∂sk/∂xk)=(−s1s1 s1−s1s2⋯−s1sk−s2s1−s2s2 s2⋯−s2sk⋮⋮⋱⋮−sks1−sks2⋯−sksk sk) nabla e_{(s)}=frac{partial e}{partial s} =(frac{partial e}{partial s_{1}},frac{partial e}{partial s_{2}}, cdots, frac{partial e}{partial s_{k}}) =( -frac{y_1}{s_1}, -frac{y_2}{s_2},cdots,-frac{y_k}{s_k}) ; % ---------- nabla s_{(x)}= frac{partial s}{partial x}= begin{pmatrix} partial s_{1}/partial x_{1}&partial s_{1}/partial x_{2}& cdots&partial s_{1}/partial x_{k} partial s_{2}/partial x_{1}&partial s_{2}/partial x_{2}& cdots&partial s_{2}/partial x_{k} vdots & vdots & ddots & vdots partial s_{k}/partial x_{1}&partial s_{k}/partial x_{2}& cdots&partial s_{k}/partial x_{k} end{pmatrix}= begin{pmatrix} -s_{1}s_{1} s_{1} & -s_{1}s_{2} & cdots & -s_{1}s_{k} -s_{2}s_{1} & -s_{2}s_{2} s_{2} & cdots & -s_{2}s_{k} vdots & vdots & ddots & vdots -s_{k}s_{1} & -s_{k}s_{2} & cdots & -s_{k}s_{k} s_{k} end{pmatrix} quad ∇e(s)=∂s∂e=(∂s1∂e,∂s2∂e,⋯,∂sk∂e)=(−s1y1,−s2y2,⋯,−skyk)∇s(x)=∂x∂s=⎝⎜⎜⎜⎛∂s1/∂x1∂s2/∂x1⋮∂sk/∂x1∂s1/∂x2∂s2/∂x2⋮∂sk/∂x2⋯⋯⋱⋯∂s1/∂xk∂s2/∂xk⋮∂sk/∂xk⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛−s1s1 s1−s2s1⋮−sks1−s1s2−s2s2 s2⋮−sks2⋯⋯⋱⋯−s1sk−s2sk⋮−sksk sk⎠⎟⎟⎟⎞
2. 求解过程 :
∂e∂xi=∂e∂s1∂s1∂xi ∂e∂s2∂s2∂xi ∂e∂s3∂s3∂xi ⋯ ∂e∂sk∂sk∂xi frac{partial e}{partial x_i} = frac{partial e}{partial s_1}frac{partial s_1}{partial x_i} frac{partial e}{partial s_2}frac{partial s_2}{partial x_i} frac{partial e}{partial s_3}frac{partial s_3}{partial x_i} cdots frac{partial e}{partial s_k}frac{partial s_k}{partial x_i} ∂xi∂e=∂s1∂e∂xi∂s1 ∂s2∂e∂xi∂s2 ∂s3∂e∂xi∂s3 ⋯ ∂sk∂e∂xi∂sk
展开 ∂e/∂xipartial e/partial x_i∂e/∂xi 可得 e 关于 X 的梯度向量 :
∇e(x)=(∂e∂s1,∂e∂s2,∂e∂s3,⋯ ,∂e∂sk)(∂s1/∂x1∂s1/∂x2⋯∂s1/∂xk∂s2/∂x1∂s2/∂x2⋯∂s2/∂xk⋮⋮⋱⋮∂sk/∂x1∂sk/∂x2⋯∂sk/∂xk) ∇e(x)=∇e(s)∇s(x) nabla e_{(x)} = (frac{partial e}{partial s_1},frac{partial e}{partial s_2},frac{partial e}{partial s_3}, cdots ,frac{partial e}{partial s_k}) begin{pmatrix} partial s_{1}/partial x_{1}&partial s_{1}/partial x_{2}& cdots&partial s_{1}/partial x_{k} partial s_{2}/partial x_{1}&partial s_{2}/partial x_{2}& cdots&partial s_{2}/partial x_{k} vdots & vdots & ddots & vdots partial s_{k}/partial x_{1}&partial s_{k}/partial x_{2}& cdots&partial s_{k}/partial x_{k} end{pmatrix} ; nabla e_{(x)} =nabla e_{(s)} nabla s_{(x)} ∇e(x)=(∂s1∂e,∂s2∂e,∂s3∂e,⋯,∂sk∂e)⎝⎜⎜⎜⎛∂s1/∂x1∂s2/∂x1⋮∂sk/∂x1∂s1/∂x2∂s2/∂x2⋮∂sk/∂x2⋯⋯⋱⋯∂s1/∂xk∂s2/∂xk⋮∂sk/∂xk⎠⎟⎟⎟⎞∇e(x)=∇e(s)∇s(x)
由于 :
∇e(s)=(−y1s1,−y2s2,⋯ ,−yksk) ∇s(x)=(−s1s1 s1−s1s2⋯−s1sk−s2s1−s2s2 s2⋯−s2sk⋮⋮⋱⋮−sks1−sks2⋯−sksk sk) nabla e_{(s)}=( -frac{y_1}{s_1}, -frac{y_2}{s_2},cdots,-frac{y_k}{s_k}) ; nabla s_{(x)} =begin{pmatrix} -s_{1}s_{1} s_{1} & -s_{1}s_{2} & cdots & -s_{1}s_{k} -s_{2}s_{1} & -s_{2}s_{2} s_{2} & cdots & -s_{2}s_{k} vdots & vdots & ddots & vdots -s_{k}s_{1} & -s_{k}s_{2} & cdots & -s_{k}s_{k} s_{k} end{pmatrix} ∇e(s)=(−s1y1,−s2y2,⋯,−skyk)∇s(x)=⎝⎜⎜⎜⎛−s1s1 s1−s2s1⋮−sks1−s1s2−s2s2 s2⋮−sks2⋯⋯⋱⋯−s1sk−s2sk⋮−sksk sk⎠⎟⎟⎟⎞
得 :
∇e(x)=(s1∑t=1kyt−y1, s2∑t=1kyt−y2,⋯ ,si∑t=1kyt−yi) ∂e∂xi=si∑t=1kyt−yi nabla e_{(x)}= (s_1sum_{t = 1}^{k}y_t- y_1, ;s_2sum_{t = 1}^{k}y_t- y_2,cdots,s_isum_{t = 1}^{k}y_t- y_i) ; frac{partial e}{partial x_i} =s_isum_{t = 1}^{k}y_t- y_i ∇e(x)=(s1t=1∑kyt−y1,s2t=1∑kyt−y2,⋯,sit=1∑kyt−yi)∂xi∂e=sit=1∑kyt−yi
结论:
将 softmax 和 cross-entropy 放在一起使用, 可以大大减少梯度求解的计算量.