ResNet可谓大名鼎鼎了,一直遵循拿来主义,没有好好去学习它,当然,作为一个提出来快五年的网络结构,已经有太多人写过它了,不好下笔。
趁着假期好好梳理一遍,相关资源:
- 原论文:https://arxiv.org/pdf/1512.03385.pdf
- Pytorch实现:https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py
本文将从源码出发,进行学习。
1. ResNet总体介绍
在ResNet的原始论文里,共介绍了几种形式:
如无特殊说明,截图均来自原始论文
作者根据网络深度不同,一共定义了5种ResNet结构,从18层到152层,每种网络结构都包含五个部分的卷积层,从conv1, conv2_x到conv5_x。这些卷积层我们拆解一下,其实就三种类型:
1.1 普通卷积conv1
conv1是一种普通的卷积,卷积核是7*7,输出64通道,步长2,输出size是112*112。图中并没有说padding值是多少,在Pytorch的官方实现中,该值为3:
代码语言:javascript复制self.inplanes = 64
self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(self.inplanes)
self.relu = nn.ReLU(inplace=True)
输入size在论文中有写,是224*224.
在这后面还有一个最大池化层:
代码语言:javascript复制self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
至此,输出就变成了56*56。
为什么输入是224*224,池化之后是56*56?
我的理解是,这更多是一种实验发现的,对作者使用的数据集效果比较好的参数(作者使用的数据集当然是比较流行的开放数据集),但是是否在特定场景下是否就是最优呢?我决定并不见得,特定场景下,输入size可能往往是比较规范的,那就完全有可能产生更好的参数。
1.2 BaseBlock残差块结构
就是ResNet18和ResNet34中的3*3卷积核的部分:
代码语言:javascript复制def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
"""3x3 convolution with padding"""
return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
padding=dilation, groups=groups, bias=False, dilation=dilation)
class BasicBlock(nn.Module):
expansion = 1
def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
base_width=64, dilation=1, norm_layer=None):
super(BasicBlock, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
if groups != 1 or base_width != 64:
raise ValueError('BasicBlock only supports groups=1 and base_width=64')
if dilation > 1:
raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
# Both self.conv1 and self.downsample layers downsample the input when stride != 1
self.conv1 = conv3x3(inplanes, planes, stride)
self.bn1 = norm_layer(planes)
self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(planes, planes)
self.bn2 = norm_layer(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(x)
out = identity
out = self.relu(out)
return out
所谓残差块,就是两个3*3的卷积层堆叠在一起:
例如,对于ResNet18的残差块conv2_x结构:
而所谓残差链接,其实关键就一句:
out = identity
就是跳层将输入和输出进行相加,以解决深层网络在训练时容易导致的梯度消失或者爆炸问题。作者论文对此有过实验:
网络从20层增加到56层的时候,训练误差和测试误差都显著增加了。这也是比较好理解吧,经过深层网络,原有的特征可能都已经消失了。
为了残差能正常连接在一起,残差块的输入输出需要是一致的。
看源码,还有一个downsample,这个后面再说。
1.2 Bottleneck残差块
除了ResNet18和34,其他几个用的都是Bottleneck残差块,还是一样,我们先看代码:
代码语言:javascript复制def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
"""3x3 convolution with padding"""
return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
padding=dilation, groups=groups, bias=False, dilation=dilation)
def conv1x1(in_planes, out_planes, stride=1):
"""1x1 convolution"""
return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
class Bottleneck(nn.Module):
# Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
# while original implementation places the stride at the first 1x1 convolution(self.conv1)
# according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
# This variant is also known as ResNet V1.5 and improves accuracy according to
# https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
base_width=64, dilation=1, norm_layer=None):
super(Bottleneck, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
width = int(planes * (base_width / 64.)) * groups
# Both self.conv2 and self.downsample layers downsample the input when stride != 1
self.conv1 = conv1x1(inplanes, width)
self.bn1 = norm_layer(width)
self.conv2 = conv3x3(width, width, stride, groups, dilation)
self.bn2 = norm_layer(width)
self.conv3 = conv1x1(width, planes * self.expansion)
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out = identity
out = self.relu(out)
return out
图示大概如下:
核心思想就是利用1*1的卷积核来灵活控制通道数,使得残差能顺利连接在一起。具体做法就是使用一个1*1卷积核将通道数降为原来的1/4(降维),在进行3*3的卷积运算之后(保持通道数不变),又使用一个1*1卷积核将通道数提升为跟输入一样的通道数(升维)。
这样做的好处是大大减少了参数的数量,如上图的参数量是:1x1x256x64 3x3x64x64 1x1x64x256 = 69632,而如果这使用BaseBlock来实现的话,参数量变为:3x3x256x256x2 = 1179648,约为前者的17倍。我们看从ResNet34到ResNet50,深度增加了差不多三分之二,但是计算量却只是增加5%左右(当然这样比较有点作弊的嫌疑,毕竟ResNet50不少的层都是1*1的卷积运算,用于升降维)。
注意这里的降维时的通道数计算:
width = int(planes * (base_width / 64.)) * groups
这里的base_width对应的,就是训练时的width_per_group参数,在默认值的情况下,width值就等于planes,显然可以通过改变width_per_group和groups参数,来改变降维后的通道数。
不过,如果说这样降维没有特征损失,我个人是不太相信的,只是说可能这种损失比较小。
2. ResNet的实现
讲完了ResNet的基础介绍,我们就可以讲其实现了:
代码语言:javascript复制class ResNet(nn.Module):
def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
groups=1, width_per_group=64, replace_stride_with_dilation=None,
norm_layer=None):
super(ResNet, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
self._norm_layer = norm_layer
self.inplanes = 64
self.dilation = 1
if replace_stride_with_dilation is None:
# each element in the tuple indicates if we should replace
# the 2x2 stride with a dilated convolution instead
replace_stride_with_dilation = [False, False, False]
if len(replace_stride_with_dilation) != 3:
raise ValueError("replace_stride_with_dilation should be None "
"or a 3-element tuple, got {}".format(replace_stride_with_dilation))
self.groups = groups
self.base_width = width_per_group
self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
bias=False)
self.bn1 = norm_layer(self.inplanes)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
dilate=replace_stride_with_dilation[0])
self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
dilate=replace_stride_with_dilation[1])
self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
dilate=replace_stride_with_dilation[2])
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512 * block.expansion, num_classes)
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
# Zero-initialize the last BN in each residual branch,
# so that the residual branch starts with zeros, and each residual block behaves like an identity.
# This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
if zero_init_residual:
for m in self.modules():
if isinstance(m, Bottleneck):
nn.init.constant_(m.bn3.weight, 0)
elif isinstance(m, BasicBlock):
nn.init.constant_(m.bn2.weight, 0)
def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
norm_layer = self._norm_layer
downsample = None
previous_dilation = self.dilation
if dilate:
self.dilation *= stride
stride = 1
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
conv1x1(self.inplanes, planes * block.expansion, stride),
norm_layer(planes * block.expansion),
)
layers = []
layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
self.base_width, previous_dilation, norm_layer))
self.inplanes = planes * block.expansion
for _ in range(1, blocks):
layers.append(block(self.inplanes, planes, groups=self.groups,
base_width=self.base_width, dilation=self.dilation,
norm_layer=norm_layer))
return nn.Sequential(*layers)
def _forward_impl(self, x):
# See note [TorchScript super()]
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x
def forward(self, x):
return self._forward_impl(x)
其中关键的方法就是_make_layer,这是用来生成残差块的。现在我们可以看一下这个downsample了,它的触发条件是,如果卷积步长不等于1或者输入通道数不满足对应的关系,则1*1的卷积运算,调整输入通道数,而这个downsample会在模块的第一个残差块进行。
这里实现中,看不出来_forward_impl这个为什么单独写成了也给函数,貌似在forward中实现并无不妥。
对于训练时,ResNet18执行的大概是:
代码语言:javascript复制ResNet(BasicBlock, [2, 2, 2, 2])
其中第二个参数是conv2_x到conv5_x这四个部分的残差块的数量。
再看一下ResNet50的执行:
代码语言:javascript复制ResNet(Bottleneck, [3, 4, 6, 3])
其他的都类似了。
另外,Pytorch实现的ResNet还有一些技巧,如:zero_init_residual等。
3. ResNet的变体
ResNet发布已经四五年了,相关的研究很多,Pytorch也实现了两种变体:
3.1 ResNeXT
这是16年底提出的结构,在Pytorch的实现中,其实只是训练时的参数不同,增加了分组卷积的参数:
代码语言:javascript复制# ResNeXt-101 32x8d model
kwargs['groups'] = 32
kwargs['width_per_group'] = 8
ResNet(Bottleneck, [3, 4, 23, 3],**kwargs)
3.2 Wide ResNet
这是16年中提出的结构,在Pytorch的实现差异也不大:
代码语言:javascript复制kwargs['width_per_group'] = 64 * 2
ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
这个参数的作用其实就是增加了降维之后的通道数为标准ResNet的两倍,也就是原来降维为1/4改为降维为1/2。
ResNet还有一些其他的升级版本,待续。