TensorRT LLM vs OpenPPL LLM

支持模型和功能对比

PPL LLM只支持baichuan、chatglm、llama三个模型，Tensor-LLM支持几乎所有大模型。

模型导入

TensorRT-LLM直接支持huggingface原模型导入，直接内存中变成自己的结构。

PPL LLM需要对huggingface原模型先转模型参数，LLama还需要先拆分模型，再合并模型。最后需要导出onnx模型，并支持根据GPU资源分片。

ppl.pmx/model_zoo/llama/huggingface at master · openppl-public/ppl.pmx (github.com)

总结：用PPL LLM需要运行多个步骤，比较麻烦。TensorRT-LLM使用起来更方便

模型量化

TensorRT-LLM是离线量化，支持更多的量化方法，smooth quant、weight only、AWQ等

PPL LLM是实时量化（i8i8），支持整个网络一起量化，也支持单个分片量化。

模型Deploy

TensorRT-LLM量化结束，不需要deploy中间模型，直接进入编译器。部分模型可以支持onnx可视化

PPL LLM不需要deploy以及编译，直接用onnx调算子。

总结：Tensorrt-LLM需要考虑其他可视化方案，或新增支持部分模型的onnx可视化。

用户使用方式

PPL LLM

W8A16/W16A16：原模型-->模型转换-->ppl.pmx导出onnx（可选择weight int8量化）-→部署云端服务

W8A8：原模型-->模型转换-->ppl.pmx导出onnx（可选择weight int8量化）-→部署云端服务（实时量化，选择i8i8模式）

ppl.llm.serving/docs/llama_guide.md at master · openppl-public/ppl.llm.serving (github.com)

TensorRT LLM

原模型-->量化-->编译-->Build导出engine（类似于我们的shmodel，包含各种量化）→Run engine

NVIDIA/TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C runtimes that execute those TensorRT engines. (github.com)

多卡并行

Tensor-LLM通过设置参数来使用多卡，--gpus_per_node：每台机器的GPU卡数量（默认是8张卡），–world_size：并行进程数量（一个进程一张卡，所以相当于使用机器上多少张卡）。不会拆分模型，build过程中使用多卡计算能力

PPL LLM通过MP和tensor_parallel_size参数（不同的阶段会用到，值需设置为一样），直接把模型拆分成MP份。

两个框架都是tensor并行

框架依赖

Tensor-LLM需要依赖tensorrt，但主要是一些单算子（卷积、激活函数、gemm等），融合算子都是Tensor-LLM自带的。

PPL LLM没有依赖

pytorch nvidia模型量化 TensorRTLLM OpenPPLLLM 大模型推理优化

0 人点赞