引言
模型的边缘端部署需要深度学习模型更加的小型化与轻量化、同时要求速度要足够快!一个量化之后的模型可以使用整数运算执行从而很大程度上降低浮点数计算开销。Pytorch框架支持8位量化,相比32位的浮点数模型,模型大小对内存需要可以降低四倍左右,硬件支持8位量化之后的模型推理可以加速2到4倍左右。模型量化是模型部署与加速推理预测首选技术方案。
Pytorch量化支持
Pytorch支持多种处理器上的深度学习模型量化技术,在大多数常见情况下都是通过训练FP32数模型然后导出转行为INT8的模型,同时Pytorch还是支持训练量化,采用伪量化测量完成训练,最后导出量化的低精度模型。Pytorch中量化模型需要三个输入要素构成,它们分别是:
代码语言:javascript复制量化配置:声明权重参数与激活函数的量化方法
计算后端:支持的硬件平台
量化引擎:引擎声明那个硬件平台支持,要跟量化配置中的声明保持一致
本地支持的量化后台包括:
X86 CPU系列 AVX2以上,支持
代码语言:javascript复制https://github.com/pytorch/FBGEMM
ARM CPUs支持
代码语言:javascript复制https://github.com/pytorch/QNNPACK
这两种方式都是支持直接量化操作的,但是GPU不支持,怎么支持GPU,Pytorch官方最新版文档说了,必须采用量化感知的训练方式训练模型,模型才支持GPU量化。
默认设置fbgemm
代码语言:javascript复制# set the qconfig for PTQ
qconfig = torch.quantization.get_default_qconfig('fbgemm')
# or, set the qconfig for QAT
qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'fbgemm'
默认设置qnnpack:
代码语言:javascript复制# set the qconfig for PTQ
qconfig = torch.quantization.get_default_qconfig('qnnpack')
# or, set the qconfig for QAT
qconfig = torch.quantization.get_default_qat_qconfig('qnnpack')
# set the qengine to control weight packing
torch.backends.quantized.engine = 'qnnpack'
Eager模式量化
动态量化
是最简单的量化方式,这种量化方式比较适合加载内存操作比推理时间长的模型,典型的就是LSTM模型的推理,它量化的前后对比如下:
静态量化
就是大家熟知的PTO(Post Training Quantization),训练后量化方式,主要针对的是CNN网络,它量化前后对比如下:
可以看出动态量化主要针对的激活函数!
量化感知训练
量化感知训练方式得到的模型精度相比其它的方式要高,对比原来浮点数模型精度下降没有PTO方式的大。它量化前后对比如下:
API函数演示:
代码语言:javascript复制import torch
# define a floating point model where some layers could benefit from QAT
class M(torch.nn.Module):
def __init__(self):
super(M, self).__init__()
# QuantStub converts tensors from floating point to quantized
self.quant = torch.quantization.QuantStub()
self.conv = torch.nn.Conv2d(1, 1, 1)
self.bn = torch.nn.BatchNorm2d(1)
self.relu = torch.nn.ReLU()
# DeQuantStub converts tensors from quantized to floating point
self.dequant = torch.quantization.DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.conv(x)
x = self.bn(x)
x = self.relu(x)
x = self.dequant(x)
return x
# create a model instance
model_fp32 = M()
# model must be set to train mode for QAT logic to work
model_fp32.train()
# attach a global qconfig, which contains information about what kind
# of observers to attach. Use 'fbgemm' for server inference and
# 'qnnpack' for mobile inference. Other quantization configurations such
# as selecting symmetric or assymetric quantization and MinMax or L2Norm
# calibration techniques can be specified here.
model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
# fuse the activations to preceding layers, where applicable
# this needs to be done manually depending on the model architecture
model_fp32_fused = torch.quantization.fuse_modules(model_fp32,
[['conv', 'bn', 'relu']])
# Prepare the model for QAT. This inserts observers and fake_quants in
# the model that will observe weight and activation tensors during calibration.
model_fp32_prepared = torch.quantization.prepare_qat(model_fp32_fused)
# run the training loop (not shown)
training_loop(model_fp32_prepared)
# Convert the observed model to a quantized model. This does several things:
# quantizes the weights, computes and stores the scale and bias value to be
# used with each activation tensor, fuses modules where appropriate,
# and replaces key operators with quantized implementations.
model_fp32_prepared.eval()
model_int8 = torch.quantization.convert(model_fp32_prepared)
# run the model, relevant calculations will happen in int8
res = model_int8(input_fp32)
预告一下,下一篇完整实例!
参考:
https://pytorch.org/docs/stable/quantization.html
https://arxiv.org/pdf/1506.02025.pdf