当前位置：首页 > news >正文

【Triton教程】向量相加

news 2026/6/1 5:43:37

Triton 是一种用于并行编程的语言和编译器。它旨在提供一个基于 Python 的编程环境，以高效编写自定义 DNN 计算内核，并能够在现代 GPU 硬件上以最大吞吐量运行。

更多 Triton 中文文档可访问 →https://triton.hyper.ai/

在本教程中，你将使用 Triton 编写一个简单的向量相加 (vector addition) 程序。

你将了解：

Triton 的基本编程模型
用于定义 Triton 内核的 triton.jit 装饰器 (decorator)
验证和基准测试自定义算子与原生参考实现的最佳实践

计算内核

import torch
import triton
import triton.language as tl@triton.jit
def add_kernel(x_ptr,  # *Pointer* to first input vector. 指向第一个输入向量的指针。y_ptr,  # *Pointer* to second input vector. 指向第二个输入向量的指针。output_ptr,  # *Pointer* to output vector. 指向输出向量的指针。n_elements,  # Size of the vector. 向量的大小。BLOCK_SIZE: tl.constexpr,  # Number of elements each program should process. 每个程序应处理的元素数量。# NOTE: `constexpr` so it can be used as a shape value. 注意：`constexpr` 因此它可以用作形状值。):# There are multiple 'programs' processing different data. We identify which program# 有多个“程序”处理不同的数据。需要确定是哪一个程序：pid = tl.program_id(axis=0)  # We use a 1D launch grid so axis is 0. 使用 1D 启动网格，因此轴为 0。# This program will process inputs that are offset from the initial data.# 该程序将处理相对初始数据偏移的输入。# For instance, if you had a vector of length 256 and block_size of 64, the programs would each access the elements [0:64, 64:128, 128:192, 192:256].# 例如，如果有一个长度为 256, 块大小为 64 的向量，程序将各自访问 [0:64, 64:128, 128:192, 192:256] 的元素。# Note that offsets is a list of pointers:# 注意 offsets 是指针列表：block_start = pid * BLOCK_SIZEoffsets = block_start + tl.arange(0, BLOCK_SIZE)# Create a mask to guard memory operations against out-of-bounds accesses.# 创建掩码以防止内存操作超出边界访问。mask = offsets < n_elements# Load x and y from DRAM, masking out any extra elements in case the input is not a multiple of the block size.# 从 DRAM 加载 x 和 y，如果输入不是块大小的整数倍，则屏蔽掉任何多余的元素。x = tl.load(x_ptr + offsets, mask=mask)y = tl.load(y_ptr + offsets, mask=mask)output = x + y# Write x + y back to DRAM.# 将 x + y 写回 DRAM。tl.store(output_ptr + offsets, output, mask=mask)

创建一个辅助函数从而： (1) 生成 z 张量，(2) 用适当的 grid/block sizes 将上述内核加入队列：

def add(x: torch.Tensor, y: torch.Tensor):# We need to preallocate the output.# 需要预分配输出。output = torch.empty_like(x)assert x.is_cuda and y.is_cuda and output.is_cudan_elements = output.numel()# The SPMD launch grid denotes the number of kernel instances that run in parallel.# SPMD 启动网格表示并行运行的内核实例的数量。# It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) -> Tuple[int].# 它类似于 CUDA 启动网格。它可以是 Tuple[int]，也可以是 Callable(metaparameters) -> Tuple[int]。# In this case, we use a 1D grid where the size is the number of blocks:# 在这种情况下，使用 1D 网格，其中大小是块的数量：grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )# NOTE:# 注意：#  - Each torch.tensor object is implicitly converted into a pointer to its first element.#  - 每个 torch.tensor 对象都会隐式转换为其第一个元素的指针。#  - `triton.jit`'ed functions can be indexed with a launch grid to obtain a callable GPU kernel.#  - `triton.jit` 函数可以通过启动网格索引来获得可调用的 GPU 内核。#  - Don't forget to pass meta-parameters as keywords arguments.#  - 不要忘记以关键字参数传递元参数。add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)# We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called, the kernel is still running asynchronously at this point.# 返回 z 的句柄，但由于 `torch.cuda.synchronize()` 尚未被调用，此时内核仍在异步运行。return output

使用上述函数计算两个 torch.tensor 对象的 element-wise sum，并测试其正确性：

torch.manual_seed(0)
size = 98432
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
output_torch = x + y
output_triton = add(x, y)
print(output_torch)
print(output_triton)
print(f'The maximum difference between torch and triton is 'f'{torch.max(torch.abs(output_torch - output_triton))}')

Out:

tensor([1.3713, 1.3076, 0.4940,  ..., 0.6724, 1.2141, 0.9733], device='cuda:0')tensor([1.3713, 1.3076, 0.4940,  ..., 0.6724, 1.2141, 0.9733], device='cuda:0')The maximum difference between torch and triton is 0.0

现在准备就绪。

基准测试

在 size 持续增长的向量上对自定义算子进行基准测试，从而比较其与 PyTorch 的性能差异。为了方便操作，Triton 提供了一系列内置工具，允许开发者简洁地绘制自定义算子在不同问题规模 (problem sizes) 下的的性能图。

@triton.testing.perf_report(triton.testing.Benchmark(x_names=['size'],  # Argument names to use as an x-axis for the plot. 用作绘图 x 轴的参数名称。x_vals=[2**i for i in range(12, 28, 1)],  # Different possible values for `x_name`. `x_name` 的不同可能值。x_log=True,  # x axis is logarithmic. x 轴为对数。line_arg='provider',  # Argument name whose value corresponds to a different line in the plot. 参数名称，其值对应于绘图中的不同线条。line_vals=['triton', 'torch'],  # Possible values for `line_arg`. `line_arg` 的可能值。line_names=['Triton', 'Torch'],  # Label name for the lines. 线条的标签名称。styles=[('blue', '-'), ('green', '-')],  # Line styles. 线条样式。ylabel='GB/s',  # Label name for the y-axis. y 轴标签名称。plot_name='vector-add-performance',  # Name for the plot. Used also as a file name for saving the plot. 绘图名称。也用作保存绘图的文件名。args={},  # Values for function arguments not in `x_names` and `y_name`. 不在 `x_names` 和 `y_name` 中的函数参数值。))
def benchmark(size, provider):x = torch.rand(size, device='cuda', dtype=torch.float32)y = torch.rand(size, device='cuda', dtype=torch.float32)quantiles = [0.5, 0.2, 0.8]if provider == 'torch':ms, min_ms, max_ms = triton.testing.do_bench(lambda: x + y, quantiles=quantiles)if provider == 'triton':ms, min_ms, max_ms = triton.testing.do_bench(lambda: add(x, y), quantiles=quantiles)gbps = lambda ms: 3 * x.numel() * x.element_size() / ms * 1e-6return gbps(ms), gbps(max_ms), gbps(min_ms)

运行上述装饰函数 (decorated function)。输入查看性能数据，输入 show_plots=True 绘制结果，以及/或者输入 save_path='/path/to/results/' 将其与原始 CSV 数据一起保存到磁盘：

benchmark.run(print_data=True, show_plots=True)

在这里插入图片描述

out:

在这里插入图片描述

Download Jupyter notebook: 01-vector-add.ipynb
Download Python source code: 01-vector-add.py
Download zipped: 01-vector-add.zip

【Triton教程】向量相加

计算内核

基准测试

相关文章：

【Triton教程】向量相加

关于CSS中毛玻璃和滤镜使用总结

陷入产出危机的我聊聊近况

HarmonyOS 开发知识总结

[WPF初学到大神] 1. 什么是WPF, MVVM框架, XAML?

matlab怎样自动搜索文件夹中的所有txt文件，并将每个txt文件中的数据存放到一个cell数组中——MATLAB批量处理数据

LabVIEW智能可变温循环PCT测试系统

SparkSQL整合Hive

Vue 3 和 Vue 2区别

React.memo和useMemo

Android中实现网络请求的方式有哪些？

安卓13usb触摸唤醒系统 android13触摸唤醒

c++常用库函数

CSS 网格布局

python实现屏幕录制，录音录制工具

elementui 的 table 组件回显已选数据时候使用toggleRowSelection 方法的坑点

MATLAB基础应用精讲-【数模应用】负二项回归（附R语言和python代码实现）

20240803 芯动科技笔试

如何将 ECharts 图表插入 HTML Canvas

突破干扰，无人机自动驾驶技术详解

新手也能懂的SSRF漏洞实战：用iwebsec靶场复现文件读取与内网探测

古戏台构件声学特性的时域有限差分方法【附模型】

金融合规审核为何人力堆积却仍漏洞百出？2026年RegTech演进与Agent全链路闭环解决方案

HarmonyOS ArkTS DateUtil 日期增减与日历计算完整指南

Owl-Alpha 新手快速上手指南

轻量化部署，异地机房快速接入，多机房管理不用再大动干戈

如何快速集成 react-native-bottom-sheet-behavior：5 分钟搞定 Android 底部弹窗

AWS DevOps Agent 完全指南

Hermes Agent工具如何自定义接入Taotoken提供商

5步完美解决Windows 10 PL2303驱动兼容性问题：完整实施方案指南