超越AITemplate，打平TensorRT，SD全系列模型加速框架stable-fast隆重登场

作者丨Ceeeee

来源丨https://zhuanlan.zhihu.com/p/669610362

编辑丨GiantPandaCV

GitHub - chengzeyi/stable-fast: An ultra lightweight inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

先放一个对比图和GitHub链接。然后声明一下，本人就是stable-fast的作者。所有对比测试都在这些框架的最新版上进行，并且绝对公平公正。

什么是stable-fast？

stable-fast是一个用于在NVIDIA GPU上优化Hugging Face Diffusers的超轻量级推理优化框架。stable-fast利用了几项关键技巧和功能来提供超快的推理优化：

CUDNN卷积融合：stable-fast实现了全部Conv Bias Add Act计算模式的完整且兼容的CUDNN卷积融合运算子操作符。
低精度&融合GEMM：stable-fast实现了一系列融合GEMM运算子操作符，这些运算子使用fp16精度进行计算，比PyTorch默认值（读取与写入fp16，计算与fp32）更快。
NHWC&融合GroupNorm：stable-fast通过OpenAI的Triton实施了高效的融合NHWC GroupNorm GELU运算子操作符，消除了channels last memory format下不必要的memory permutation。
全部 trace模型：stable-fast改进了torch.jit.trace接口以适应复杂模型的追踪，大多数StableDiffusionPipeline的每个部分都可以被追踪并转换成TorchScript。它比torch.compile更稳定，且CPU开销更低，并支持ControlNet和LoRA。
CUDA Graph：stable-fast可以将UNet结构捕捉到CUDA Graph格式中，当批次规模小时，可以减少CPU开销。
融合多头自注意力：stable-fast仅仅使用xformers，并使其与TorchScript兼容。

同时，stable-fast还拥有所有框架中最快的模型编译速度，不像AITemplate和TensorRT，它们需要耗费数分钟来完成模型的编译，stable-fast可以在10s内完成这一切。这是一个显著的优势！

更重要的是：stable-fast完全兼容SD15、SD21、SDXL、LCM甚至是最新的SDXL-Turbo！

同时，如果你是SD Next的用户，那么官方在Dev分支已经支持了stable-fast：https://github.com/vladmandic/automatic/tree/dev

如果你是ComfyUI的用户，那么这里有一个插件可以用（会比SD Next慢）：GitHub - gameltb/ComfyUI_stable_fast

性能

RTX 4080 (512x512, batch size 1, fp16, tcmalloc enabled, in WSL2)

代码语言：javascript复制

| Framework                              | SD 1.5    | SD 2.1    | SD XL (1024x1024) |
| -------------------------------------- | --------- | ----------| ----------------- |
| Vanilla PyTorch (2.1.0 cu118)          | 29.5 it/s | 32.4 it/s | 4.6 it/s          |
| torch.compile (2.1.0 cu118, NHWC UNet) | 40.0 it/s | 44.0 it/s | 6.1 it/s          |
| AITemplate                             | 44.2 it/s | untested  | untested          |
| OneFlow                                | 50.3 it/s | untested  | untested          |
| AUTO1111 WebUI                         | 17.2 it/s | 15.2 it/s | 3.6 it/s          |
| AUTO1111 WebUI (with SDPA)             | 24.5 it/s | 26.1 it/s | 4.3 it/s          |
| TensorRT (AUTO1111 WebUI)              | 40.8 it/s | untested  | untested          |
| TensorRT Official Demo                 | 52.6 it/s | untested  | untested          |
| Stable Fast (with xformers & Triton)   | 50.5 it/s | 53.3 it/s | 8.3 it/s          |

没有对比就没有伤害，stable-fast比宣称使用了灰常先进架构的AITemplate快了很多，并几乎达到了和TRT相同的速度，同时，stable-fast还完全开源，并正在持续优化中！我相信击败TensorRT只是时间问题。

安装

具体请参见项目GitHub页面，用户可以直接安装Linux和Windows下的预编译wheel包（在项目Release页面下载），也可以自己从源码编译，总之是非常简单也非常快的，开箱即用。

代码语言：javascript复制

# Make sure you have CUDNN/CUBLAS installed.# https://developer.nvidia.com/cudnn# https://developer.nvidia.com/cublas# Install PyTorch with CUDA and other packages at firstpip3 install 'torch>=1.12.0' 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0'# Windows user: Triton might be not available, you could skip it.# (Optional) Makes the build much fasterpip3 install ninja# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types# You can also install the latest stable release from PyPI# pip3 install -v -U stable-fastpip3 install -v -U git https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast# (this can take dozens of minutes)

使用方法

具体也请参见GitHub上的说明，这里简单给一个最基本的例子：

代码语言：javascript复制

import timeimport torchfrom diffusers import (StableDiffusionPipeline,
                       EulerAncestralDiscreteScheduler)from sfast.compilers.stable_diffusion_pipeline_compiler import (
    compile, CompilationConfig)def load_model():
    model = StableDiffusionPipeline.from_pretrained(
        'runwayml/stable-diffusion-v1-5',
        torch_dtype=torch.float16)

    model.scheduler = EulerAncestralDiscreteScheduler.from_config(
        model.scheduler.config)
    model.safety_checker = None
    model.to(torch.device('cuda'))
    return modelmodel = load_model()config = CompilationConfig.Default()# xformers and Triton are suggested for achieving best performance.try:
    import xformers
    config.enable_xformers = Trueexcept ImportError:
    print('xformers not installed, skip')try:
    import triton
    config.enable_triton = Trueexcept ImportError:
    print('Triton not installed, skip')# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.config.enable_cuda_graph = Truemodel = compile(model, config)kwarg_inputs = dict(
    prompt=
    '(masterpiece:1,2), best quality, masterpiece, best detail face, a beautiful girl',
    height=512,
    width=512,
    num_inference_steps=30,
    num_images_per_prompt=1,)# NOTE: Warm it up.# The initial calls will trigger compilation and might be very slow.# After that, it should be very fast.for _ in range(3):
    output_image = model(**kwarg_inputs).images[0]# Let's see it!# Note: Progress bar might work incorrectly due to the async nature of CUDA.begin = time.time()output_image = model(**kwarg_inputs).images[0]print(f'Inference time: {time.time() - begin:.3f}s')# Let's view it in terminal!from sfast.utils.term_image import print_imageprint_image(output_image, max_width=80)

- The End -

GiantPandaCV

欢迎关注我们，一起成长！

model 编译框架模型优化

0 人点赞