作者丨Ceeeee
来源丨https://zhuanlan.zhihu.com/p/669610362
编辑丨GiantPandaCV
GitHub - chengzeyi/stable-fast: An ultra lightweight inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
先放一个对比图和GitHub链接。然后声明一下,本人就是stable-fast的作者。所有对比测试都在这些框架的最新版上进行,并且绝对公平公正。
什么是stable-fast?
stable-fast是一个用于在NVIDIA GPU上优化Hugging Face Diffusers的超轻量级推理优化框架。stable-fast利用了几项关键技巧和功能来提供超快的推理优化:
- CUDNN卷积融合:stable-fast实现了全部Conv Bias Add Act计算模式的完整且兼容的CUDNN卷积融合运算子操作符。
- 低精度&融合GEMM:stable-fast实现了一系列融合GEMM运算子操作符,这些运算子使用fp16精度进行计算,比PyTorch默认值(读取与写入fp16,计算与fp32)更快。
- NHWC&融合GroupNorm:stable-fast通过OpenAI的Triton实施了高效的融合NHWC GroupNorm GELU运算子操作符,消除了channels last memory format下不必要的memory permutation。
- 全部 trace模型:stable-fast改进了torch.jit.trace接口以适应复杂模型的追踪,大多数StableDiffusionPipeline的每个部分都可以被追踪并转换成TorchScript。它比torch.compile更稳定,且CPU开销更低,并支持ControlNet和LoRA。
- CUDA Graph:stable-fast可以将UNet结构捕捉到CUDA Graph格式中,当批次规模小时,可以减少CPU开销。
- 融合多头自注意力:stable-fast仅仅使用xformers,并使其与TorchScript兼容。
同时,stable-fast还拥有所有框架中最快的模型编译速度,不像AITemplate和TensorRT,它们需要耗费数分钟来完成模型的编译,stable-fast可以在10s内完成这一切。这是一个显著的优势!
更重要的是:stable-fast完全兼容SD15、SD21、SDXL、LCM甚至是最新的SDXL-Turbo!
同时,如果你是SD Next的用户,那么官方在Dev分支已经支持了stable-fast:https://github.com/vladmandic/automatic/tree/dev
如果你是ComfyUI的用户,那么这里有一个插件可以用(会比SD Next慢):GitHub - gameltb/ComfyUI_stable_fast
性能
RTX 4080 (512x512, batch size 1, fp16, tcmalloc enabled, in WSL2)
代码语言:javascript复制| Framework | SD 1.5 | SD 2.1 | SD XL (1024x1024) |
| -------------------------------------- | --------- | ----------| ----------------- |
| Vanilla PyTorch (2.1.0 cu118) | 29.5 it/s | 32.4 it/s | 4.6 it/s |
| torch.compile (2.1.0 cu118, NHWC UNet) | 40.0 it/s | 44.0 it/s | 6.1 it/s |
| AITemplate | 44.2 it/s | untested | untested |
| OneFlow | 50.3 it/s | untested | untested |
| AUTO1111 WebUI | 17.2 it/s | 15.2 it/s | 3.6 it/s |
| AUTO1111 WebUI (with SDPA) | 24.5 it/s | 26.1 it/s | 4.3 it/s |
| TensorRT (AUTO1111 WebUI) | 40.8 it/s | untested | untested |
| TensorRT Official Demo | 52.6 it/s | untested | untested |
| Stable Fast (with xformers & Triton) | 50.5 it/s | 53.3 it/s | 8.3 it/s |
没有对比就没有伤害,stable-fast比宣称使用了灰常先进架构的AITemplate快了很多,并几乎达到了和TRT相同的速度,同时,stable-fast还完全开源,并正在持续优化中!我相信击败TensorRT只是时间问题。
安装
具体请参见项目GitHub页面,用户可以直接安装Linux和Windows下的预编译wheel包(在项目Release页面下载),也可以自己从源码编译,总之是非常简单也非常快的,开箱即用。
代码语言:javascript复制# Make sure you have CUDNN/CUBLAS installed.# https://developer.nvidia.com/cudnn# https://developer.nvidia.com/cublas# Install PyTorch with CUDA and other packages at firstpip3 install 'torch>=1.12.0' 'diffusers>=0.19.3' 'xformers>=0.0.20' 'triton>=2.1.0'# Windows user: Triton might be not available, you could skip it.# (Optional) Makes the build much fasterpip3 install ninja# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types# You can also install the latest stable release from PyPI# pip3 install -v -U stable-fastpip3 install -v -U git https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast# (this can take dozens of minutes)
使用方法
具体也请参见GitHub上的说明,这里简单给一个最基本的例子:
代码语言:javascript复制import timeimport torchfrom diffusers import (StableDiffusionPipeline,
EulerAncestralDiscreteScheduler)from sfast.compilers.stable_diffusion_pipeline_compiler import (
compile, CompilationConfig)def load_model():
model = StableDiffusionPipeline.from_pretrained(
'runwayml/stable-diffusion-v1-5',
torch_dtype=torch.float16)
model.scheduler = EulerAncestralDiscreteScheduler.from_config(
model.scheduler.config)
model.safety_checker = None
model.to(torch.device('cuda'))
return modelmodel = load_model()config = CompilationConfig.Default()# xformers and Triton are suggested for achieving best performance.try:
import xformers
config.enable_xformers = Trueexcept ImportError:
print('xformers not installed, skip')try:
import triton
config.enable_triton = Trueexcept ImportError:
print('Triton not installed, skip')# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.config.enable_cuda_graph = Truemodel = compile(model, config)kwarg_inputs = dict(
prompt=
'(masterpiece:1,2), best quality, masterpiece, best detail face, a beautiful girl',
height=512,
width=512,
num_inference_steps=30,
num_images_per_prompt=1,)# NOTE: Warm it up.# The initial calls will trigger compilation and might be very slow.# After that, it should be very fast.for _ in range(3):
output_image = model(**kwarg_inputs).images[0]# Let's see it!# Note: Progress bar might work incorrectly due to the async nature of CUDA.begin = time.time()output_image = model(**kwarg_inputs).images[0]print(f'Inference time: {time.time() - begin:.3f}s')# Let's view it in terminal!from sfast.utils.term_image import print_imageprint_image(output_image, max_width=80)
- The End -
GiantPandaCV
欢迎关注我们,一起成长!