多标签softmax + cross-entropy交叉熵损失函数详解及反向传播中的梯度求导

2019-06-15 11:18:40 浏览数 (1)

版权声明:所有的说明性文档基于 Creative Commons 协议, 所有的代码基于 MIT 协议. All documents are licensed under the Creative Commons License, all codes are licensed under the MIT License. https://cloud.tencent.com/developer/article/1446395

摘要

本文求解 softmax cross-entropy 在反向传播中的梯度.

相关

配套代码, 请参考文章 :

Python和PyTorch对比实现多标签softmax cross-entropy交叉熵损失及反向传播

有关 softmax 的详细介绍, 请参考 :

softmax函数详解及反向传播中的梯度求导

有关 cross-entropy 的详细介绍, 请参考 :

通过案例详解cross-entropy交叉熵损失函数

系列文章索引 :

https://blog.csdn.net/oBrightLamp/article/details/85067981

正文

在大多数教程中, softmax 和 cross-entropy 总是一起出现, 求梯度的时候也是一起考虑.

softmax 和 cross-entropy 的梯度, 已经在上面的两篇文章中分别给出.

1. 题目

考虑一个输入向量 x, 经 softmax 函数归一化处理后得到向量 s 作为预测的概率分布, 已知向量 y 为真实的概率分布, 由 cross-entropy 函数计算得出误差值 error (标量 e ), 求 e 关于 x 的梯度.

x=(x1,x2,x3,⋯ ,xk)s=softmax(x)si=exi∑t=1kexte=crossEntropy(s,y)=−∑i=1kyilog(si) quad x = (x_1, x_2, x_3, cdots, x_k) quad s = softmax(x) quad s_{i} = frac{e^{x_{i}}}{ sum_{t = 1}^{k}e^{x_{t}}} quad e = crossEntropy(s, y) = -sum_{i = 1}^{k}y_{i}log(s_{i}) x=(x1​,x2​,x3​,⋯,xk​)s=softmax(x)si​=∑t=1k​ext​exi​​e=crossEntropy(s,y)=−i=1∑k​yi​log(si​)

已知 :

∇e(s)=∂e∂s=(∂e∂s1,∂e∂s2,⋯ ,∂e∂sk)=(−y1s1,−y2s2,⋯ ,−yksk)  ∇s(x)=∂s∂x=(∂s1/∂x1∂s1/∂x2⋯∂s1/∂xk∂s2/∂x1∂s2/∂x2⋯∂s2/∂xk⋮⋮⋱⋮∂sk/∂x1∂sk/∂x2⋯∂sk/∂xk)=(−s1s1 s1−s1s2⋯−s1sk−s2s1−s2s2 s2⋯−s2sk⋮⋮⋱⋮−sks1−sks2⋯−sksk sk) nabla e_{(s)}=frac{partial e}{partial s} =(frac{partial e}{partial s_{1}},frac{partial e}{partial s_{2}}, cdots, frac{partial e}{partial s_{k}}) =( -frac{y_1}{s_1}, -frac{y_2}{s_2},cdots,-frac{y_k}{s_k}) ; % ---------- nabla s_{(x)}= frac{partial s}{partial x}= begin{pmatrix} partial s_{1}/partial x_{1}&partial s_{1}/partial x_{2}& cdots&partial s_{1}/partial x_{k} partial s_{2}/partial x_{1}&partial s_{2}/partial x_{2}& cdots&partial s_{2}/partial x_{k} vdots & vdots & ddots & vdots partial s_{k}/partial x_{1}&partial s_{k}/partial x_{2}& cdots&partial s_{k}/partial x_{k} end{pmatrix}= begin{pmatrix} -s_{1}s_{1} s_{1} & -s_{1}s_{2} & cdots & -s_{1}s_{k} -s_{2}s_{1} & -s_{2}s_{2} s_{2} & cdots & -s_{2}s_{k} vdots & vdots & ddots & vdots -s_{k}s_{1} & -s_{k}s_{2} & cdots & -s_{k}s_{k} s_{k} end{pmatrix} quad ∇e(s)​=∂s∂e​=(∂s1​∂e​,∂s2​∂e​,⋯,∂sk​∂e​)=(−s1​y1​​,−s2​y2​​,⋯,−sk​yk​​)∇s(x)​=∂x∂s​=⎝⎜⎜⎜⎛​∂s1​/∂x1​∂s2​/∂x1​⋮∂sk​/∂x1​​∂s1​/∂x2​∂s2​/∂x2​⋮∂sk​/∂x2​​⋯⋯⋱⋯​∂s1​/∂xk​∂s2​/∂xk​⋮∂sk​/∂xk​​⎠⎟⎟⎟⎞​=⎝⎜⎜⎜⎛​−s1​s1​ s1​−s2​s1​⋮−sk​s1​​−s1​s2​−s2​s2​ s2​⋮−sk​s2​​⋯⋯⋱⋯​−s1​sk​−s2​sk​⋮−sk​sk​ sk​​⎠⎟⎟⎟⎞​

2. 求解过程 :

∂e∂xi=∂e∂s1∂s1∂xi ∂e∂s2∂s2∂xi ∂e∂s3∂s3∂xi ⋯ ∂e∂sk∂sk∂xi frac{partial e}{partial x_i} = frac{partial e}{partial s_1}frac{partial s_1}{partial x_i} frac{partial e}{partial s_2}frac{partial s_2}{partial x_i} frac{partial e}{partial s_3}frac{partial s_3}{partial x_i} cdots frac{partial e}{partial s_k}frac{partial s_k}{partial x_i} ∂xi​∂e​=∂s1​∂e​∂xi​∂s1​​ ∂s2​∂e​∂xi​∂s2​​ ∂s3​∂e​∂xi​∂s3​​ ⋯ ∂sk​∂e​∂xi​∂sk​​

展开 ∂e/∂xipartial e/partial x_i∂e/∂xi​ 可得 e 关于 X 的梯度向量 :

∇e(x)=(∂e∂s1,∂e∂s2,∂e∂s3,⋯ ,∂e∂sk)(∂s1/∂x1∂s1/∂x2⋯∂s1/∂xk∂s2/∂x1∂s2/∂x2⋯∂s2/∂xk⋮⋮⋱⋮∂sk/∂x1∂sk/∂x2⋯∂sk/∂xk)  ∇e(x)=∇e(s)∇s(x) nabla e_{(x)} = (frac{partial e}{partial s_1},frac{partial e}{partial s_2},frac{partial e}{partial s_3}, cdots ,frac{partial e}{partial s_k}) begin{pmatrix} partial s_{1}/partial x_{1}&partial s_{1}/partial x_{2}& cdots&partial s_{1}/partial x_{k} partial s_{2}/partial x_{1}&partial s_{2}/partial x_{2}& cdots&partial s_{2}/partial x_{k} vdots & vdots & ddots & vdots partial s_{k}/partial x_{1}&partial s_{k}/partial x_{2}& cdots&partial s_{k}/partial x_{k} end{pmatrix} ; nabla e_{(x)} =nabla e_{(s)} nabla s_{(x)} ∇e(x)​=(∂s1​∂e​,∂s2​∂e​,∂s3​∂e​,⋯,∂sk​∂e​)⎝⎜⎜⎜⎛​∂s1​/∂x1​∂s2​/∂x1​⋮∂sk​/∂x1​​∂s1​/∂x2​∂s2​/∂x2​⋮∂sk​/∂x2​​⋯⋯⋱⋯​∂s1​/∂xk​∂s2​/∂xk​⋮∂sk​/∂xk​​⎠⎟⎟⎟⎞​∇e(x)​=∇e(s)​∇s(x)​

由于 :

∇e(s)=(−y1s1,−y2s2,⋯ ,−yksk)  ∇s(x)=(−s1s1 s1−s1s2⋯−s1sk−s2s1−s2s2 s2⋯−s2sk⋮⋮⋱⋮−sks1−sks2⋯−sksk sk) nabla e_{(s)}=( -frac{y_1}{s_1}, -frac{y_2}{s_2},cdots,-frac{y_k}{s_k}) ; nabla s_{(x)} =begin{pmatrix} -s_{1}s_{1} s_{1} & -s_{1}s_{2} & cdots & -s_{1}s_{k} -s_{2}s_{1} & -s_{2}s_{2} s_{2} & cdots & -s_{2}s_{k} vdots & vdots & ddots & vdots -s_{k}s_{1} & -s_{k}s_{2} & cdots & -s_{k}s_{k} s_{k} end{pmatrix} ∇e(s)​=(−s1​y1​​,−s2​y2​​,⋯,−sk​yk​​)∇s(x)​=⎝⎜⎜⎜⎛​−s1​s1​ s1​−s2​s1​⋮−sk​s1​​−s1​s2​−s2​s2​ s2​⋮−sk​s2​​⋯⋯⋱⋯​−s1​sk​−s2​sk​⋮−sk​sk​ sk​​⎠⎟⎟⎟⎞​

得 :

∇e(x)=(s1∑t=1kyt−y1,  s2∑t=1kyt−y2,⋯ ,si∑t=1kyt−yi)  ∂e∂xi=si∑t=1kyt−yi nabla e_{(x)}= (s_1sum_{t = 1}^{k}y_t- y_1, ;s_2sum_{t = 1}^{k}y_t- y_2,cdots,s_isum_{t = 1}^{k}y_t- y_i) ; frac{partial e}{partial x_i} =s_isum_{t = 1}^{k}y_t- y_i ∇e(x)​=(s1​t=1∑k​yt​−y1​,s2​t=1∑k​yt​−y2​,⋯,si​t=1∑k​yt​−yi​)∂xi​∂e​=si​t=1∑k​yt​−yi​

结论:

将 softmax 和 cross-entropy 放在一起使用, 可以大大减少梯度求解的计算量.

0 人点赞