Pytorch mixed precision 概述（混合精度）

https://www.zhihu.com/people/superjie13

本文对pytorch中的mixed precision进行测试。主要包括两部分，第一部分为mixed precision使用概述，第二部分为实际测试。参考torch官网 Automatic Mixed Precision

Mixed precision使用概述

通常，automatic mixed precision training 需要使用 torch.cuda.amp.autocast 和 torch.cuda.amp.GradScaler 。

1. 1 首先实例化 torch.cuda.amp.autocast(enable=True) 作为上下文管理器或者装饰器，从而使脚本使用混合精度运行。注意：autocast 一般情况下只封装前向传播过程（包括loss的计算），并不包括反向传播（反向传播的数据类型与相应前向传播中的数据类型相同）。

1. 2 使用Gradient scaling 防止在反向传播过程由于中梯度太小（float16无法表示小幅值的变化）从而下溢为0的情况。torch.cuda.amp.GradScaler() 可以自动进行gradient scaling。注意：由于GradScaler()对gradient进行了scale，因此每个参数的gradient应该在optimizer更新参数前unscaled，从而使学习率不受影响。

代码语言：javascript复制

import torchvision
import torch
import torch.cuda.amp
import gc
import time

# Timing utilities
start_time = None

def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()  # 同步后得出的时间才是实际运行的时间
    start_time = time.time()

def end_timer_and_print(local_msg):
    torch.cuda.synchronize()
    end_time = time.time()
    print("n"   local_msg)
    print("Total execution time = {:.3f} sec".format(end_time - start_time))
    print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))

num_batches = 50
batch_size = 70
epochs = 3

# 随机创建训练数据
data = [torch.randn(batch_size, 3, 224, 224, device="cuda") for _ in range(num_batches)]
targets = [torch.randint(0, 1000, size=(batch_size, ), device='cuda') for _ in range(num_batches)]
# 创建一个模型
net = torchvision.models.resnext50_32x4d().cuda()
# 定义损失函数
loss_fn = torch.nn.CrossEntropyLoss().cuda()
# 定义优化器
opt = torch.optim.SGD(net.parameters(), lr=0.001)

# 是否使用混合精度训练
use_amp = True

# Constructs scaler once, at the beginning of the convergence run, using default args.
# If your network fails to converge with default GradScaler args, please file an issue.
# The same GradScaler instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh GradScaler instance.  GradScaler instances are lightweight.
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        # 放大loss  Calls backward() on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(opt)

        # Updates the scale for next iteration.
        scaler.update()
        opt.zero_grad(set_to_none=True) # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")


02




混合精度测试

测试环境：ubuntu18.04, pytorch 1.7.1, python3.7, RTX2080-8G

2.1 use_amp = False

batch size = 40

2.2 use_amp = True

batch size = 40

从实验2.1和2.2中，可以发现在batch size=40的情况下，不使用混合精度时，GPU内存占用为7011MB，运行时间为47.55 s。而使用混合精度时，GPU内存占用为4997MB，运行时间为27.006 s。在当前运行配置中，内存占用节省了约28.73%，运行时间节省了约43.21%。这也就意味着我们可以使用更大的batch size来提升运行效率。

2.3 use_amp = True

batch size = 70

批量计算 pytorch 腾讯云测试服务

0 人点赞