当前位置：首页 > news >正文

TensorRT-LLM七日谈 Day3

news 2026/2/9 6:37:51

今天主要是结合理论进一步熟悉TensorRT-LLM的内容

从下面的分享可以看出，TensorRT-LLM是在TensorRT的基础上进行了进一步封装，提供拼batch，量化等推理加速实现方式。

下面的图片更好的展示了TensorRT-LLM的流程，包含权重转换，构建Engine，以及推理，评估等内容。总结一下就是三步。

不想看图的话，可以看看AI的总结,我放在附录中。

下图也很好的展示的trt-llm推理的全流程。

多卡并行

值得注意的是，trt-llm特意考虑了多卡部署的使用场景。通过tp-size参数来控制张量并行的程度，pp-size来控制溧水县并行的程度。

流水线并行

量化

权重&激活值量化

KV Cache量化

量化精度影响

从下图可以看出，使用FP8进行量化，量化精度较高。

性能调优

关于性能调优，trt-llm中也使用了类似于vllm中xontinuous batching的策略。

附录

The image describes an overview of the TensorRT-LLM (Large Language Model) workflow. Here's a summary of the key steps and elements involved:

1. Input Models:
- Various external models from frameworks like **HuggingFace**, **NeMo**, **AMMO**, and **Jax** can be used as inputs.

2. TRT-LLM Checkpoint:
- These external models are converted into a format defined by TRT-LLM using scripts like **convert_checkpoint.py** or **quantize.py**.
- This conversion determines several key backward layer parameters, including:
- Quantization method
- Parallelization method
- And more...

3. TRT-LLM Engines:
- After converting to the checkpoint format, the **trtllm-build** command is used to further convert and optimize the checkpoint into **TensorRT Engines**.
- During this step, important inference parameters are set, such as:
- Max batch size
- Max input length
- Max output length
- Max beam width
- Plugin configuration
- And others...
- Most of the automatic optimizations occur at this stage.

4. Application Development:
- Using C++/Python APIs, developers can build applications with these optimized engines.
- TensorRT-LLM comes with several built-in tools to help with secondary development:
- **summarize.py** for text summarization
- **mmlu.py** for accuracy testing
- **run.py** for a dry run to verify the model
- **benchmark** for benchmarking
- The runtime options include:
- **Temperature** (for sampling)
- **Top K** (for top K sampling)
- **Top P** (for nucleus sampling)

This workflow outlines how to integrate and optimize models for efficient inference with TensorRT-LLM and leverage its tools for application development and performance testing.

NVIDIA AI 加速精讲堂-TensorRT-LLM 应用与部署_哔哩哔哩_bilibili

TensorRT-LLM七日谈 Day3

多卡并行

流水线并行

量化

权重&激活值量化

KV Cache量化

量化精度影响

性能调优

附录

相关文章：

TensorRT-LLM七日谈 Day3

如何使用Pandas库处理大型数据集？

XHR 创建对象

# 在执行 rpm 卸载软件使用 nodeps 参数时，报错 error: package nodeps is not installed 分析

C++的类和动态内存分配（深拷贝与浅拷贝）并实现自己的string类

通过观测云 DataKit Extension 接入 AWS Lambda 最佳实践

MySQL-三范式视图

多线程（三）：线程等待获取线程引用线程休眠线程状态

Hi3244 应用指导

【LeetCode热题100】哈希

Java的四种循环语句

Qt杂记目录

项目开发--基于docker实现模型容器化服务

C语言 | Leetcode C语言题解之第477题汉明距离总和

Bug剖析

HI3516DV500 相机部分架构初探

训练yolo系列出现问题mAP, R, P等为零

数字媒体技术基础：色度子采样（4:4:4、4:2:2 、4:2:0）

tkinter库的应用小示例：文本编辑器

信息抽取数据集处理——RAMS

变量 varablie 声明- Rust 变量 let mut 声明与 C/C++ 变量声明对比分析

VB.net复制Ntag213卡写入UID

盘古信息PCB行业解决方案：以全域场景重构，激活智造新未来

QMC5883L的驱动

Leetcode 3577. Count the Number of Computer Unlocking Permutations

第 86 场周赛：矩阵中的幻方、钥匙和房间、将数组拆分成斐波那契序列、猜猜这个单词

智能分布式爬虫的数据处理流水线优化：基于深度强化学习的数据质量控制

今日学习：Spring线程池|并发修改异常|链路丢失|登录续期|VIP过期策略|数值类缓存

如何在网页里填写 PDF 表格？

rnn判断string中第一次出现a的下标