超越AITemplate,打平TensorRT,SD全系列模型加速框架stable-fast隆重登场

2023-12-13 12:51:25 浏览数 (1)

作者丨Ceeeee

来源丨https://zhuanlan.zhihu.com/p/669610362

编辑丨GiantPandaCV


GitHub - chengzeyi/stable-fast: An ultra lightweight inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

先放一个对比图和GitHub链接。然后声明一下,本人就是stable-fast的作者。所有对比测试都在这些框架的最新版上进行,并且绝对公平公正。

什么是stable-fast?

stable-fast是一个用于在NVIDIA GPU上优化Hugging Face Diffusers的超轻量级推理优化框架。stable-fast利用了几项关键技巧和功能来提供超快的推理优化:

  • CUDNN卷积融合:stable-fast实现了全部Conv Bias Add Act计算模式的完整且兼容的CUDNN卷积融合运算子操作符。
  • 低精度&融合GEMM:stable-fast实现了一系列融合GEMM运算子操作符,这些运算子使用fp16精度进行计算,比PyTorch默认值(读取与写入fp16,计算与fp32)更快。
  • NHWC&融合GroupNorm:stable-fast通过OpenAI的Triton实施了高效的融合NHWC GroupNorm GELU运算子操作符,消除了channels last memory format下不必要的memory permutation。
  • 全部 trace模型:stable-fast改进了torch.jit.trace接口以适应复杂模型的追踪,大多数StableDiffusionPipeline的每个部分都可以被追踪并转换成TorchScript。它比torch.compile更稳定,且CPU开销更低,并支持ControlNet和LoRA
  • CUDA Graph:stable-fast可以将UNet结构捕捉到CUDA Graph格式中,当批次规模小时,可以减少CPU开销。
  • 融合多头自注意力:stable-fast仅仅使用xformers,并使其与TorchScript兼容。

同时,stable-fast还拥有所有框架中最快的模型编译速度,不像AITemplate和TensorRT,它们需要耗费数分钟来完成模型的编译,stable-fast可以在10s内完成这一切。这是一个显著的优势!

更重要的是:stable-fast完全兼容SD15、SD21、SDXL、LCM甚至是最新的SDXL-Turbo!

同时,如果你是SD Next的用户,那么官方在Dev分支已经支持了stable-fast:https://github.com/vladmandic/automatic/tree/dev

如果你是ComfyUI的用户,那么这里有一个插件可以用(会比SD Next慢)GitHub - gameltb/ComfyUI_stable_fast

性能

RTX 4080 (512x512, batch size 1, fp16, tcmalloc enabled, in WSL2)

代码语言:javascript复制
| Framework                              | SD 1.5    | SD 2.1    | SD XL (1024x1024) |
| -------------------------------------- | --------- | ----------| ----------------- |
| Vanilla PyTorch (2.1.0 cu118)          | 29.5 it/s | 32.4 it/s | 4.6 it/s          |
| torch.compile (2.1.0 cu118, NHWC UNet) | 40.0 it/s | 44.0 it/s | 6.1 it/s          |
| AITemplate                             | 44.2 it/s | untested  | untested          |
| OneFlow                                | 50.3 it/s | untested  | untested          |
| AUTO1111 WebUI                         | 17.2 it/s | 15.2 it/s | 3.6 it/s          |
| AUTO1111 WebUI (with SDPA)             | 24.5 it/s | 26.1 it/s | 4.3 it/s          |
| TensorRT (AUTO1111 WebUI)              | 40.8 it/s | untested  | untested          |
| TensorRT Official Demo                 | 52.6 it/s | untested  | untested          |
| Stable Fast (with xformers & Triton)   | 50.5 it/s | 53.3 it/s | 8.3 it/s          |

没有对比就没有伤害,stable-fast比宣称使用了灰常先进架构的AITemplate快了很多,并几乎达到了和TRT相同的速度,同时,stable-fast还完全开源,并正在持续优化中!我相信击败TensorRT只是时间问题。

安装

具体请参见项目GitHub页面,用户可以直接安装LinuxWindows下的预编译wheel包(在项目Release页面下载),也可以自己从源码编译,总之是非常简单也非常快的,开箱即用。

代码语言:javascript复制
# Make sure you have CUDNN/CUBLAS installed.# https://developer.nvidia.com/cudnn# https://developer.nvidia.com/cublas# Install PyTorch with CUDA and other packages at firstpip3 install 'torch>=1.12.0' 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0'# Windows user: Triton might be not available, you could skip it.# (Optional) Makes the build much fasterpip3 install ninja# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types# You can also install the latest stable release from PyPI# pip3 install -v -U stable-fastpip3 install -v -U git https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast# (this can take dozens of minutes)

使用方法

具体也请参见GitHub上的说明,这里简单给一个最基本的例子:

代码语言:javascript复制
import timeimport torchfrom diffusers import (StableDiffusionPipeline,
                       EulerAncestralDiscreteScheduler)from sfast.compilers.stable_diffusion_pipeline_compiler import (
    compile, CompilationConfig)def load_model():
    model = StableDiffusionPipeline.from_pretrained(
        'runwayml/stable-diffusion-v1-5',
        torch_dtype=torch.float16)

    model.scheduler = EulerAncestralDiscreteScheduler.from_config(
        model.scheduler.config)
    model.safety_checker = None
    model.to(torch.device('cuda'))
    return modelmodel = load_model()config = CompilationConfig.Default()# xformers and Triton are suggested for achieving best performance.try:
    import xformers
    config.enable_xformers = Trueexcept ImportError:
    print('xformers not installed, skip')try:
    import triton
    config.enable_triton = Trueexcept ImportError:
    print('Triton not installed, skip')# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.config.enable_cuda_graph = Truemodel = compile(model, config)kwarg_inputs = dict(
    prompt=
    '(masterpiece:1,2), best quality, masterpiece, best detail face, a beautiful girl',
    height=512,
    width=512,
    num_inference_steps=30,
    num_images_per_prompt=1,)# NOTE: Warm it up.# The initial calls will trigger compilation and might be very slow.# After that, it should be very fast.for _ in range(3):
    output_image = model(**kwarg_inputs).images[0]# Let's see it!# Note: Progress bar might work incorrectly due to the async nature of CUDA.begin = time.time()output_image = model(**kwarg_inputs).images[0]print(f'Inference time: {time.time() - begin:.3f}s')# Let's view it in terminal!from sfast.utils.term_image import print_imageprint_image(output_image, max_width=80)

- The End -

GiantPandaCV

欢迎关注我们,一起成长!

0 人点赞