模型剪枝

2022-05-06 10:26:53 浏览数 (1)

模型剪枝就是删除小于一定阈值的连接或神经元节点得到更加稀疏的网络。

在这个过程中很有可能因为连接剪枝是一个非常不规则的操作,我们实现的时候通常会维护一个维度相等的矩阵,称为掩膜(mask)矩阵。掩膜矩阵为1的地方表示要保持的权重,为0的地方表示要剪掉的权重。

  • 剪枝的不同力度,从单个神经元和连接到整个网络层

模型剪枝的力度可以是权重、神经元到整个网络层。

上图中从左到右分别是最细粒度的权重级别(Fine-gained)、向量级别(Vector-level)、卷积核级别(Kernel-level)、通道级别(Filter-level)。在上图的最左边是一个不规则的,非结构化的,细粒度的过程,每一个四方形表示一个卷积核,里面表示的是一个连接,红色的部分表示是被剪掉的连接。关于卷积核连接特征图的部分可以参考Tensorflow深度学习算法整理 中卷积神经网络。因为不规则,真正用这种剪枝模型来进行前向推理的时候,还需要专门的硬件来配合使用,我们需要维护一个表来查哪些是剪掉的,哪些是保留的。第二个是向量级别的,它剪掉的是卷积核内一排连接,具有一定的规则性。第三个是针对不同层的某些卷积核进行剪枝,第四个是对整个层的所有卷积核进行剪枝。这两个被称为结构化的剪枝,是规则的剪枝,不需要特殊的硬件支持。

  • Dropout/DropConnect

DropOut大家都不陌生,它是删掉神经元,具体可以参考Tensorflow深度学习算法整理 。DropConnect是剪掉神经元与神经元之间的连接,它是一种非结构化剪枝,对应到权重级别,它更加不规律。但是这两种方法只是在训练的时候使用,在测试的时候是不会对模型产生影响的,所以它们终究还不是应用于模型剪枝的方法。

权重的冗余性

我们之所以能够对模型进行剪枝,本质上还是网络中的一些参数是冗余的,我们删除一些并不会对网络造成很大的影响,所以才可以去剪枝。

  • 卷积核滤波器的冗余性

卷积核存在相位互补性(符号上的一个差异,比如一个卷积核=另外一个卷积核*(-1)),这两个卷积核是线性相关的,就导致存在冗余性。如果我们有更好的办法获得线性相关性的话,就可以降低滤波器的数量,需要同时学习两个线性相关的变换(CReLU)。有关线性相关的内容可以参考线性代数整理 中的线性相关和线性无关。

我们假设滤波器为ø,我们在神经网络中的某一层任取一个卷积核

,假设可以求到一个

,它们的内积最小(方向相反),那么此时就得到一个新的向量

其实就是跟

相反的向量。此时我们在该神经网络层中除了

外剩下的卷积核中找出一个跟

余弦相似度最小的一个

并画出µi分布的直方统计图

我们假设每一个

都能找到一个和它计算相似度越小并且非常接近于-1的值,那么µ的峰值应该越接近于-1(距离0值越远)。如果越找不到该值,它应该越接近于0。从上图可以看出,越是网络浅层,统计相似度越低,表明越有相位互补性;相反越往深层,越难找到和

相位相反的滤波器。

为了说明这个问题,我们需要建立一个新的计算单元Concatenated Rectified Linear Unit(CReLU)

假设conv的通道数为c那么crelu之后的通道数就变成了2c。这里我们将之前的cifar10的一段代码进行修改如下。因为CReLU有将通道数翻倍的效果,所以我们预先将通道数降低为原来的一半。

代码语言:javascript复制
import tensorflow as tf
from tensorflow.keras import datasets, models, Sequential, layers, losses, optimizers

if __name__ == '__main__':

    (X_train, y_train), (X_test, y_test) = datasets.cifar10.load_data()
    X_train = tf.constant(X_train, dtype=tf.float32) / 255
    y_train = tf.constant(y_train, dtype=tf.int32)
    X_test = tf.constant(X_test, dtype=tf.float32) / 255
    y_test = tf.constant(y_test, dtype=tf.int32)
    data_train = tf.data.Dataset.from_tensor_slices((X_train, y_train))
    data_test = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    data_train = data_train.shuffle(10000).batch(64)
    data_test = data_test.shuffle(10000).batch(64)

    class CReLU(layers.Layer):

        def __init__(self, **kwargs):
            super(CReLU, self).__init__(**kwargs)
            self.relu = layers.ReLU()

        def call(self, x, mask=None):
            out1 = self.relu(x)
            out2 = self.relu(-x)
            return tf.concat([out1, out2], axis=-1)

    class Vggbase(models.Model):

        def __init__(self):
            super(Vggbase, self).__init__()
            self.conv1 = Sequential([
                layers.Conv2D(32, (3, 3), strides=1, padding='same'),
                layers.BatchNormalization(),
                CReLU()
            ])
            self.max_pooling1 = layers.MaxPool2D((2, 2), strides=2)
            self.conv2_1 = Sequential([
                layers.Conv2D(64, (3, 3), strides=1, padding='same'),
                layers.BatchNormalization(),
                CReLU()
            ])
            self.conv2_2 = Sequential([
                layers.Conv2D(64, (3, 3), strides=1, padding='same'),
                layers.BatchNormalization(),
                CReLU()
            ])
            self.max_pooling2 = layers.MaxPool2D((2, 2), strides=2)
            self.conv3_1 = Sequential([
                layers.Conv2D(128, (3, 3), strides=1, padding='same'),
                layers.BatchNormalization(),
                CReLU()
            ])
            self.conv3_2 = Sequential([
                layers.Conv2D(128, (3, 3), strides=1, padding='same'),
                layers.BatchNormalization(),
                CReLU()
            ])
            self.max_pooling3 = layers.MaxPool2D((2, 2), strides=2, padding='same')
            self.conv4_1 = Sequential([
                layers.Conv2D(256, (3, 3), strides=1, padding='same'),
                layers.BatchNormalization(),
                CReLU()
            ])
            self.conv4_2 = Sequential([
                layers.Conv2D(256, (3, 3), strides=1, padding='same'),
                layers.BatchNormalization(),
                CReLU()
            ])
            self.max_pooling4 = layers.MaxPool2D((2, 2), strides=2)
            self.fc = layers.Dense(10)

        def call(self, x):
            batchsize = x.shape[0]
            out = self.conv1(x)
            out = self.max_pooling1(out)
            out = self.conv2_1(out)
            out = self.conv2_2(out)
            out = self.max_pooling2(out)
            out = self.conv3_1(out)
            out = self.conv3_2(out)
            out = self.max_pooling3(out)
            out = self.conv4_1(out)
            out = self.conv4_2(out)
            out = self.max_pooling4(out)
            out = tf.reshape(out, (batchsize, -1))
            out = self.fc(out)
            return out

    epoch_num = 200
    lr = 0.001
    net = Vggbase()
    loss_func = losses.SparseCategoricalCrossentropy(from_logits=True)
    scheduler = optimizers.schedules.ExponentialDecay(initial_learning_rate=lr,
                                                      decay_steps=3905,
                                                      decay_rate=0.9)
    optimizer = optimizers.Adam(learning_rate=scheduler)
    for epoch in range(epoch_num):
        for i, (images, labels) in enumerate(data_train):
            with tf.GradientTape() as tape:
                outputs = net(images)
                labels = tf.squeeze(labels, axis=-1)
                loss = loss_func(labels, outputs)
            grads = tape.gradient(loss, net.trainable_variables)
            optimizer.apply_gradients(zip(grads, net.trainable_variables))
            pred = tf.argmax(outputs, axis=-1)
            pred = tf.cast(pred, tf.int32)
            correct = tf.where(pred == labels).shape[0]
            print("epoch is ", epoch, "step ", i, "loss is: ", loss.numpy(),
                  "mini-batch correct is: ", 100.0 * correct / 64)

这里我们来看一下使用CReLU替换掉ReLU单元的测试结果

通过上图我们可以发现,将ReLU替换成CReLU,模型的错误率在降低。并且通过右上图,我们可以看到,降低了模型通道数再使用CReLU的参数量只有0.7M,是直接使用ReLU正常的一半。

由于Cifa10的数据量比较小,我们再来看一下Image-Net的测试结果,通过上图,我们可以看到当使用CReLU对1、4、7层进行替换之后,它的错误率由Baseline的41.81降到了40.45,性能是提升的。当我们对1-4层进行替换之后,降低到39.82。但当我们对1-7层进行替换之后,反而上升了,对1-9层进行替换之后继续上升,这说明了只有浅层才具有明显的相位互补性,而深层则不具备这种特性。

Tensorflow的模型剪枝

代码语言:javascript复制
import tensorflow as tf
from tensorflow.keras import datasets, layers, losses, Sequential, models
import tensorflow_model_optimization as tfmot
import tempfile
import numpy as np
import os
import zipfile


def setup_model():
    model = Sequential([
        layers.InputLayer(input_shape=(28, 28, 1)),
        layers.Conv2D(12, (3, 3), activation='relu'),
        layers.MaxPool2D((2, 2)),
        layers.Flatten(),
        layers.Dense(10)
    ])
    return model

def setup_pretrained_weights():
    model = setup_model()
    model.compile(optimizer='adam', loss=losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    model.fit(data_train, epochs=4, validation_data=data_test)
    # 创建一个临时文件
    pretrained_model = tempfile.mktemp('.h5', dir='./')
    # 将模型写入该文件
    models.save_model(model, pretrained_model, include_optimizer=False)
    return pretrained_model

def apply_pruning_to_dense(layer):
    if isinstance(layer, layers.Dense):
        print("Apply pruning to Dense")
        return tfmot.sparsity.keras.prune_low_magnitude(layer)
    return layer

def get_gzipped_model_size(file):
    zipped_file = tempfile.mktemp('.zip')
    with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
        f.write(file)
    return os.path.getsize(zipped_file)

if __name__ == '__main__':

    (X_train, y_train), (X_test, y_test) = datasets.mnist.load_data()
    num_images = X_train.shape[0]
    X_train = tf.constant(X_train, dtype=tf.float32) / 255
    X_train = tf.reshape(X_train, (X_train.shape[0], 28, 28, 1))
    y_train = tf.constant(y_train, dtype=tf.int32)
    X_test = tf.constant(X_test, dtype=tf.float32) / 255
    X_test = tf.reshape(X_test, (X_test.shape[0], 28, 28, 1))
    y_test = tf.constant(y_test, dtype=tf.int32)
    data_train = tf.data.Dataset.from_tensor_slices((X_train, y_train))
    data_test = tf.data.Dataset.from_tensor_slices((X_test, y_test))
    data_train = data_train.shuffle(10000).batch(64)
    data_test = data_test.shuffle(10000).batch(64)

    pretrained_model = setup_pretrained_weights()
    print(pretrained_model)

    base_model = models.load_model(pretrained_model)
    print(base_model.summary())

    end_step = np.ceil(num_images / 64).astype(np.int32) * 4
    # 计算稀疏性
    pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.5,
                                                            final_sparsity=0.8,
                                                            begin_step=0,
                                                            end_step=end_step)
    # 建立剪枝模型
    model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(base_model, pruning_schedule)
    print(model_for_pruning.summary())
    logdir = tempfile.mktemp()
    callbacks = [
        tfmot.sparsity.keras.UpdatePruningStep(),
        tfmot.sparsity.keras.PruningSummaries(log_dir=logdir)
    ]
    model_for_pruning.compile(optimizer='adam', loss=losses.SparseCategoricalCrossentropy(from_logits=True),
                              metrics=['accuracy'])
    model_for_pruning.fit(data_train, epochs=4, validation_data=data_test, callbacks=callbacks)

运行结果

代码语言:javascript复制
Epoch 1/4
938/938 [==============================] - 3s 3ms/step - loss: 0.3580 - accuracy: 0.9039 - val_loss: 0.1550 - val_accuracy: 0.9564
Epoch 2/4
938/938 [==============================] - 3s 3ms/step - loss: 0.1305 - accuracy: 0.9639 - val_loss: 0.0962 - val_accuracy: 0.9720
Epoch 3/4
938/938 [==============================] - 3s 3ms/step - loss: 0.0915 - accuracy: 0.9743 - val_loss: 0.0736 - val_accuracy: 0.9773
Epoch 4/4
938/938 [==============================] - 3s 3ms/step - loss: 0.0744 - accuracy: 0.9788 - val_loss: 0.0721 - val_accuracy: 0.9767
./tmp_r_zgrss.h5
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 12)        120       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 12)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 2028)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                20290     
=================================================================
Total params: 20,410
Trainable params: 20,410
Non-trainable params: 0
_________________________________________________________________
None
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
prune_low_magnitude_conv2d_1 (None, 26, 26, 12)        230       
_________________________________________________________________
prune_low_magnitude_max_pool (None, 13, 13, 12)        1         
_________________________________________________________________
prune_low_magnitude_flatten_ (None, 2028)              1         
_________________________________________________________________
prune_low_magnitude_dense_1  (None, 10)                40572     
=================================================================
Total params: 40,804
Trainable params: 20,410
Non-trainable params: 20,394
_________________________________________________________________
None
Epoch 1/4
  1/938 [..............................] - ETA: 18:51 - loss: 0.0755 - accuracy: 0.9688
938/938 [==============================] - 5s 4ms/step - loss: 0.0772 - accuracy: 0.9792 - val_loss: 0.0705 - val_accuracy: 0.9788
Epoch 2/4
938/938 [==============================] - 3s 4ms/step - loss: 0.0764 - accuracy: 0.9791 - val_loss: 0.0761 - val_accuracy: 0.9764
Epoch 3/4
938/938 [==============================] - 3s 4ms/step - loss: 0.0770 - accuracy: 0.9781 - val_loss: 0.0763 - val_accuracy: 0.9746
Epoch 4/4
938/938 [==============================] - 3s 4ms/step - loss: 0.0705 - accuracy: 0.9800 - val_loss: 0.0706 - val_accuracy: 0.9766

对比基准模型和剪枝模型,我们可以看到,剪枝后的参数量多了Non-trainable params: 20,394,即多了20394个不可训练的参数。是tensorflow-model-optimization为网络中的每个权重添加的不可训练掩码,表示是否要修剪该权重,掩码为0或1。在修剪完模型后,我们需要使用strip_pruning来删除暂时添加的这些Non-trainable params。

代码语言:javascript复制
# 去除不可训练权重
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)
print(model_for_export.summary())
pruned_file = tempfile.mktemp('.h5', dir='./')
models.save_model(model_for_export, pruned_file, include_optimizer=False)

print(get_gzipped_model_size(pretrained_model))
print(get_gzipped_model_size(pruned_file))

运行结果

代码语言:javascript复制
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 26, 26, 12)        120       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 12)        0         
_________________________________________________________________
flatten (Flatten)            (None, 2028)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                20290     
=================================================================
Total params: 20,410
Trainable params: 20,410
Non-trainable params: 0
_________________________________________________________________
None
78074
25791

由结果可以看到,经过剪枝再去掉了不可训练权重后,模型大小变成了基准模型大小的1/3。

0 人点赞