明月深度学习实践004：ResNet网络结构学习

ResNet可谓大名鼎鼎了，一直遵循拿来主义，没有好好去学习它，当然，作为一个提出来快五年的网络结构，已经有太多人写过它了，不好下笔。

趁着假期好好梳理一遍，相关资源：

原论文：https://arxiv.org/pdf/1512.03385.pdf
Pytorch实现：https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py

本文将从源码出发，进行学习。

1. ResNet总体介绍

在ResNet的原始论文里，共介绍了几种形式：

如无特殊说明，截图均来自原始论文

作者根据网络深度不同，一共定义了5种ResNet结构，从18层到152层，每种网络结构都包含五个部分的卷积层，从conv1, conv2_x到conv5_x。这些卷积层我们拆解一下，其实就三种类型：

1.1 普通卷积conv1

conv1是一种普通的卷积，卷积核是7*7，输出64通道，步长2，输出size是112*112。图中并没有说padding值是多少，在Pytorch的官方实现中，该值为3：

代码语言：javascript复制

self.inplanes = 64
self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(self.inplanes)
self.relu = nn.ReLU(inplace=True)

输入size在论文中有写，是224*224.

在这后面还有一个最大池化层：

代码语言：javascript复制

self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

至此，输出就变成了56*56。

为什么输入是224*224，池化之后是56*56？

我的理解是，这更多是一种实验发现的，对作者使用的数据集效果比较好的参数（作者使用的数据集当然是比较流行的开放数据集），但是是否在特定场景下是否就是最优呢？我决定并不见得，特定场景下，输入size可能往往是比较规范的，那就完全有可能产生更好的参数。

1.2 BaseBlock残差块结构

就是ResNet18和ResNet34中的3*3卷积核的部分：

代码语言：javascript复制

def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out  = identity
        out = self.relu(out)
        return out

所谓残差块，就是两个3*3的卷积层堆叠在一起：

例如，对于ResNet18的残差块conv2_x结构：

而所谓残差链接，其实关键就一句：

out = identity

就是跳层将输入和输出进行相加，以解决深层网络在训练时容易导致的梯度消失或者爆炸问题。作者论文对此有过实验：

网络从20层增加到56层的时候，训练误差和测试误差都显著增加了。这也是比较好理解吧，经过深层网络，原有的特征可能都已经消失了。

为了残差能正常连接在一起，残差块的输入输出需要是一致的。

看源码，还有一个downsample，这个后面再说。

1.2 Bottleneck残差块

除了ResNet18和34，其他几个用的都是Bottleneck残差块，还是一样，我们先看代码：

代码语言：javascript复制

def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)

def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
    
class Bottleneck(nn.Module):
    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
    # This variant is also known as ResNet V1.5 and improves accuracy according to
    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out  = identity
        out = self.relu(out)
        return out

图示大概如下：

核心思想就是利用1*1的卷积核来灵活控制通道数，使得残差能顺利连接在一起。具体做法就是使用一个1*1卷积核将通道数降为原来的1/4（降维），在进行3*3的卷积运算之后（保持通道数不变），又使用一个1*1卷积核将通道数提升为跟输入一样的通道数（升维）。

这样做的好处是大大减少了参数的数量，如上图的参数量是：1x1x256x64 3x3x64x64 1x1x64x256 = 69632，而如果这使用BaseBlock来实现的话，参数量变为：3x3x256x256x2 = 1179648，约为前者的17倍。我们看从ResNet34到ResNet50，深度增加了差不多三分之二，但是计算量却只是增加5%左右（当然这样比较有点作弊的嫌疑，毕竟ResNet50不少的层都是1*1的卷积运算，用于升降维）。

注意这里的降维时的通道数计算：

width = int(planes * (base_width / 64.)) * groups

这里的base_width对应的，就是训练时的width_per_group参数，在默认值的情况下，width值就等于planes，显然可以通过改变width_per_group和groups参数，来改变降维后的通道数。

不过，如果说这样降维没有特征损失，我个人是不太相信的，只是说可能这种损失比较小。

2. ResNet的实现

讲完了ResNet的基础介绍，我们就可以讲其实现了：

代码语言：javascript复制

class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def _forward_impl(self, x):
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

    def forward(self, x):
        return self._forward_impl(x)

其中关键的方法就是_make_layer，这是用来生成残差块的。现在我们可以看一下这个downsample了，它的触发条件是，如果卷积步长不等于1或者输入通道数不满足对应的关系，则1*1的卷积运算，调整输入通道数，而这个downsample会在模块的第一个残差块进行。

这里实现中，看不出来_forward_impl这个为什么单独写成了也给函数，貌似在forward中实现并无不妥。

对于训练时，ResNet18执行的大概是：

代码语言：javascript复制

ResNet(BasicBlock, [2, 2, 2, 2])

其中第二个参数是conv2_x到conv5_x这四个部分的残差块的数量。

再看一下ResNet50的执行：

代码语言：javascript复制

ResNet(Bottleneck, [3, 4, 6, 3])

其他的都类似了。

另外，Pytorch实现的ResNet还有一些技巧，如：zero_init_residual等。

3. ResNet的变体

ResNet发布已经四五年了，相关的研究很多，Pytorch也实现了两种变体：

3.1 ResNeXT

这是16年底提出的结构，在Pytorch的实现中，其实只是训练时的参数不同，增加了分组卷积的参数：

代码语言：javascript复制

# ResNeXt-101 32x8d model
kwargs['groups'] = 32
kwargs['width_per_group'] = 8
ResNet(Bottleneck, [3, 4, 23, 3],**kwargs)

3.2 Wide ResNet

这是16年中提出的结构，在Pytorch的实现差异也不大：

代码语言：javascript复制

kwargs['width_per_group'] = 64 * 2
ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)

这个参数的作用其实就是增加了降维之后的通道数为标准ResNet的两倍，也就是原来降维为1/4改为降维为1/2。

ResNet还有一些其他的升级版本，待续。

pytorch

0 人点赞