TinyML-5:TFLite Quantization背后的运行机制

2021-01-17 16:01:05 浏览数 (1)

引文

上一篇文章描述了为什么quantization 量化的int8足够运行推理,以及Quantization量化对TinyML的重要性,但是没有深入说明Quantization的实现机制,本篇博文打算从TFlite的案例代码切入,从代码的Optimize选项展开讲TFLite背后Quantization的PTQ,QAT技术等。

TF-Lite example: Optimize Options

tflite exmapletflite exmaple

众所周知,使用TFLite转换TF model的Quantization量化技术可以缩小weights、提高推理的速度,它背后的机理? 上图代码典型应用场景,使用TFLite的converter对saved model进行转换,converter的optimizations的tf.lite.Optimize的有三个可选参数(DEFAULT, OPTIMIZE_FOR_SIZE, OPTIMIZE_FOR_LATENCY),它们都有哪些不同?望文生义,“FOR_SIZE"应该是着重优化模型的大小,"FOR_LATENCY"应该是优化推理的速度;那么问题来了,同样是QUANTIZATION,两个方向实现机制有什么不同?

Quantization 技术分类

高层次来看,TFLite的Quantization技术有两个大类:

  • Post-training Quantization (PTQ) 训练后量化:
    • Quantized Weight Compression(for size) 量化权重压缩
    • Quantized Inference Calculation (for latency) 量化推理计算
  • Quantization-aware Training (QAT)量化意识训练:

QAT量化意识训练:在训练过程中量化权重。在此,即使是梯度也针对量化的权重进行计算。通俗的说,训练过程在每层的输出进行量化,让网络习惯精度下降的训练,最终达到在推理部署时候获得更小的精度下降损失。本文着重讲PTQ,以后有机会再展开阐述。

Post-training Quantization (PTQ) 训练后量化

PTQ所做的都是把TF model的weights的float32转换为合适的int8,存储在tflite model中,运行时把它转换为浮点数。不过在压缩和运行转换时候不同策略造成二者的不同,

Quantized Weight Compression(for size)algo

quantized weight compression for sizequantized weight compression for size

decompress解压把模型保存的weights的int8转换回去float32,并将范围缩放回其原始值,然后执行标准的浮点乘法;获得的好处是压缩网络,模型的尺寸小了。

Quantized Inference Calculation (for latency)

摆脱浮点计算以加快推理是量化的另外一个选择,具体来说,把输出的浮点计算转换为整数乘法。这里要扯开话题,谈下背景关联知识:floating point vs Fixed point.

Floating point vs Fixed Point

浮点数使用尾数和指数表示实际值,并且两者都可以变化。指数允许表示范围广泛的数字,尾数给出精度。小数点可以“浮动”,即出现在相对于数字的任何位置。

Floating point vs Fixed pointFloating point vs Fixed point

如果用固定比例因子替换指数,则可以使用整数表示相对于此常数(即该常数的整数倍)的数字值。小数点的位置现在由比例因子“固定”。回到数字行示例,比例因子的值确定行上2个刻度之间的最小距离,此类刻度的数量取决于我们用来表示整数的位数(对于8位固定点) ,256或28)。我们可以使用它们在范围和精度之间进行权衡。任何不是常数的整数倍的值都将四舍五入到最接近的点。

Pseudocode

quantized inference calculation for latencyquantized inference calculation for latency

比如我们手动降低每个输入的点积的精度,因此就不再需要32位浮点值的全部范围,可以用整数或者定点浮点,也就是整数乘法来实现整个推理。

put it togeter

下图(摘自一个博文)把PTQ相关技术内容组合在一起,较好的总结。

post-training quantizationpost-training quantization

Further Reading

  • TinyML实践-1:What & Why TinyML?
  • TinyML实践-2:How TinyML Works?
  • TinyML实践-3:牛运动姿态识别的落地实现
  • TinyML-4:(Quantization) 为什么int8足够用于ML
  • Edx HarvardX TinyML2-1.4: Machine Learning on Mobile and Edge IoT Devices - Part 2
  • How to accelerate and compress neural networks with quantization
  • 8-Bit Quantization and TensorFlow Lite: Speeding up mobile inference with low precision

Quantization in TF-Lite:

acob, Benoit, et al. “Quantization and training of neural networks for efficient integer-arithmetic-only inference.” arXiv preprint arXiv:1712.05877 (2017).

Quantized training

  • Gupta, Suyog, et al. “Deep learning with limited numerical precision.” International Conference on Machine Learning. 2015.
  • Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. “Training deep neural networks with low precision multiplications.” arXiv preprint arXiv:1412.7024 (2014).
  • Wu, Shuang, et al. “Training and inference with integers in deep neural networks.” arXiv preprint arXiv:1802.04680 (2018).

Extremely low-bit quantization

  • Zhu, Chenzhuo, et al. “Trained ternary quantization.” arXiv preprint arXiv:1612.01064 (2016).
  • Courbariaux, Matthieu, et al. “Binarized neural networks: Training deep neural networks with weights and activations constrained to 1 or-1.” arXiv preprint arXiv:1602.02830 (2016).
  • Rastegari, Mohammad, et al. “Xnor-net: Imagenet classification using binary convolutional neural networks.”European Conference on Computer Vision. Springer, Cham, 2016.

Quantization for compression

Han, Song, Huizi Mao, and William J. Dally. “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.” arXiv preprint arXiv:1510.00149 (2015).

0 人点赞