当前位置：首页 > article >正文

CANN/cann-bench量化矩阵乘法算子

article 2026/5/10 2:56:51

QuantMatmul 算子 API 描述【免费下载链接】cann-bench评测AI在处理CANN领域代码任务的能力涵盖算子生成、算子优化等领域支撑模型选型、训练效果评估统一量化评估标准识别Agent能力短板构建CANN领域评测平台推动AI能力在CANN领域的持续演进。项目地址: https://gitcode.com/cann/cann-bench1. 算子简介量化矩阵乘法算子。完成量化的矩阵乘计算支持 int8/int4 输入输出 float16/bfloat16/int8/int32。主要应用场景大语言模型 W8A8/W4A4 推理中的 Linear 层KV cache 量化流水线中的量化 GEMM静态 per-channel 或动态 per-token 量化方案算子特征难度等级L3Contraction输入 2–6 维int8 或 int4 数据输出 float16/bfloat16/int8/int32支持 per-tensor / per-channel / per-group / per-token 多级量化2. 算子定义数学公式无 bias $$ out x1 \mathbin{} x2 * scale offset $$bias 为 int32 $$ out (x1 \mathbin{} x2 bias) * scale offset $$bias 为 bfloat16/float32无 offset $$ out x1 \mathbin{} x2 * scale bias $$带 pertoken_scale $$ out (x1 \mathbin{} x2 * scale offset) * pertoken_scale $$步骤说明矩阵乘mm[...,m,n] x1[...,m,k] x2[...,k,n]int8 在硬件上累加到 int32。int32 biaspre-scale若 bias 为 int32先与累加结果相加。反量化 scale按 scale 形状广播支持 per-tensor[1]或 per-channel[n]。offset反量化后的偏移调整。pertoken_scale可选沿 m 维广播的 per-token 缩放。浮点 biaspost-scale若 bias 为 bf16/fp32 且无 offset在反量化后相加。cast按 output_dtype 输出。3. 接口规范算子原型cann_bench.quant_matmul( Tensor x1, Tensor x2, Tensor scale, *, Tensor? offsetNone, Tensor? pertoken_scaleNone, Tensor? biasNone, str? output_dtypeNone, int[]? group_sizesNone, ) - Tensor out输入参数参数类型Shapedtype描述x1Tensor (必选)[..., m, k]2–6 维int8 / int32左矩阵。int32 表示 int4 类型每个 int32 存放 8 个 int4x2Tensor (必选)[..., k, n]2–6 维最后一维 ≤ 65535int8 / int32右矩阵与 x1 dtype 一致scaleTensor (必选)[t](t1 或 n)或 2D[ceil(k/group_k), n]float32 / int64 / bfloat16量化缩放因子offsetTensor (可选)[t](t1 或 n)或 2D与 scale 相同float32 / float16反量化偏移。scale 为 2D 时必选pertoken_scaleTensor (可选)[m]float32per-token 缩放因子biasTensor (可选)[n]/[1, n]/[batch, 1, n]int32 / bfloat16 / float16 / float32偏置项output_dtypestr (可选)--输出 dtypeint8 / float16 / bfloat16 / int32默认 int8group_sizesint[] (可选)--分组量化粒度 [group_m, group_n, group_k]输出参数Shapedtype描述out[..., m, n]由 output_dtype 决定计算结果数据类型组合Atlas 推理系列加速卡 | x1 | x2 | scale | offset | bias | pertoken_scale | output_dtype | |----|----|-------|--------|------|---------------|--------------| | int8 | int8 | int64/float32 | None | int32/None | None | float16 | | int8 | int8 | int64/float32 | float32/None | int32/None | None | int8 |Atlas A2/A3 系列 | x1 | x2 | scale | offset | bias | pertoken_scale | output_dtype | |----|----|-------|--------|------|---------------|--------------| | int8 | int8 | int64/float32 | None | int32/None | None | float16 | | int8 | int8 | int64/float32 | float32/None | int32/None | None | int8 | | int8 | int8 | float32/bfloat16 | None | int32/bfloat16/float32/None | float32/None | bfloat16 | | int8 | int8 | float32 | None | int32/bfloat16/float32/None | float32 | float16 | | int32 | int32 | int64/float32 | None | int32/None | None | float16 | | int32 | int32 | float32 | float16 | None | float32 | bfloat16/float16 | | int8 | int8 | float32/bfloat16 | None | int32/None | None | int32 |规则与约束x1.shape[-1] x2.shape[-2] kx2.shape[-1] ≤ 65535scale 为 2D 时offset 必选且 shape 与 scale 相同bias输出 2/4/5/6 维时必须 1D输出 3 维时可为 1D 或 3Dint4 场景x1/x2 为 int32每个 int32 存放 8 个 int4shape 最后一维缩小 8 倍group_sizes 取值范围 [0, 65535]4. 精度要求采用生态算子精度标准进行验证。误差指标平均相对误差MERE采样点中相对误差平均值$$ \text{MERE} \text{avg}(\frac{\text{abs}(actual - golden)}{\text{abs}(golden)\text{1e-7}}) $$最大相对误差MARE采样点中相对误差最大值$$ \text{MARE} \max(\frac{\text{abs}(actual - golden)}{\text{abs}(golden)\text{1e-7}}) $$通过标准数据类型FLOAT16BFLOAT16FLOAT32HiFLOAT32FLOAT8 E4M3FLOAT8 E5M2通过阈值(Threshold)2^-102^-72^-132^-112^-32^-2当平均相对误差 MERE Threshold最大相对误差 MARE 10 * Threshold 时判定为通过。5. 标准 Golden 代码import torch from typing import Optional, List def quant_matmul( x1: torch.Tensor, x2: torch.Tensor, scale: torch.Tensor, offset: Optional[torch.Tensor] None, pertoken_scale: Optional[torch.Tensor] None, bias: Optional[torch.Tensor] None, output_dtype: Optional[str] None, group_sizes: Optional[List[int]] None, ) - torch.Tensor: 量化矩阵乘法对标 torch_npu.npu_quant_matmul Args: x1: [..., m, k] int8/int32 左矩阵 x2: [..., k, n] int8/int32 右矩阵 scale: [t] 或 2D反量化 scale offset: [t] 或 2D反量化偏移 pertoken_scale: [m] per-token scale bias: [n] 或 [batch, 1, n] 偏置 output_dtype: 输出类型默认 int8 group_sizes: 分组量化粒度 Returns: out: [..., m, n] # 矩阵乘int8 用 float32 等效 mm torch.matmul(x1.float(), x2.float()) # int32 bias 在反量化前累加 if bias is not None and bias.dtype torch.int32: mm mm bias.float() # 反量化 scale y mm * scale.float() # offset if offset is not None: y y offset.float() # pertoken_scale if pertoken_scale is not None: y y * pertoken_scale.float().unsqueeze(-1) # 浮点 bias无 offset 时 if bias is not None and bias.dtype ! torch.int32 and offset is None: y y bias.float() # 输出 dtype if output_dtype is None or output_dtype int8: out_dtype torch.int8 elif output_dtype float16: out_dtype torch.float16 elif output_dtype bfloat16: out_dtype torch.bfloat16 elif output_dtype int32: out_dtype torch.int32 else: raise ValueError(funsupported output_dtype: {output_dtype}) return y.to(out_dtype)6. 额外信息算子调用示例import torch import cann_bench # int8 输入float16 输出per-channel scale x1 torch.randint(-128, 127, (1024, 3584), dtypetorch.int8, devicenpu) x2 torch.randint(-128, 127, (3584, 3584), dtypetorch.int8, devicenpu) scale torch.rand(3584, dtypetorch.float32, devicenpu) * 0.01 out cann_bench.quant_matmul(x1, x2, scale, output_dtypefloat16) # 带 offset int32 bias pertoken_scalebfloat16 输出 x1 torch.randint(-128, 127, (1024, 4096), dtypetorch.int8, devicenpu) x2 torch.randint(-128, 127, (4096, 14336), dtypetorch.int8, devicenpu) scale torch.rand(14336, dtypetorch.float32, devicenpu) * 0.01 offset torch.rand(14336, dtypetorch.float32, devicenpu) bias torch.randint(-100, 100, (14336,), dtypetorch.int32, devicenpu) pertoken torch.rand(1024, dtypetorch.float32, devicenpu) out cann_bench.quant_matmul(x1, x2, scale, offsetoffset, pertoken_scalepertoken, biasbias, output_dtypebfloat16)CANN 底层实现aclnnQuantMatmulV4: 基础量化矩阵乘aclnnQuantMatmulV5: A8W4 / A4W4 分组量化aclnnQuantMatmulWeightNz: weight NZ 格式优化【免费下载链接】cann-bench评测AI在处理CANN领域代码任务的能力涵盖算子生成、算子优化等领域支撑模型选型、训练效果评估统一量化评估标准识别Agent能力短板构建CANN领域评测平台推动AI能力在CANN领域的持续演进。项目地址: https://gitcode.com/cann/cann-bench创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANN/cann-bench量化矩阵乘法算子

相关文章：

CANN/cann-bench量化矩阵乘法算子

CANN/ops-transformer FlashAttention变长分数计算V5

CANN/atvoss二元运算符基类

精通MagiskBoot：Android启动镜像修改与Root权限获取实战指南

MCPal：基于MCP协议为AI助手构建原生桌面通知系统

想转行AI？这4个高薪赛道速来！大模型岗位深度解析，普通人也能进！

AArch64处理器ID_AA64PFR2_EL1寄存器解析与应用

GPT-4o图像生成实战：从提示词工程到五大核心场景应用

并行关联扫描与牛顿方法在状态空间模型中的应用

通用资源管理库resourcelib：依赖注入与生命周期管理实践

AI自动化文献综述：NLP与机器学习驱动的科研效率革命

数字示波器频率响应与上升时间测量技术解析

CANN/ops-transformer FlashAttention可变长评分

HKUDS开源NanoBot

系统级自动化测试框架设计：从核心原理到工程实践

在Taotoken控制台中清晰追踪项目成本与各模型消耗明细

多模态情感识别系统：完整实现与代码详解

能耗管理系统是什么？主要有哪几种关键功能和应用场景？

Azure/setup-helm：GitHub Actions 中 Helm 客户端安装的标准化解决方案

AI智能体工作空间管理：Workspace Manager Skill提升项目组织与自动化效率

基于多智能体提示工程的AI团队协作框架ClubGPT深度解析

边缘设备LLM推理性能与热管理对比研究

MoltGrid：为AI智能体提供记忆、任务与协作的后台基础设施

CANN/metadef AscendString构造析构

拓扑量子计算的可扩展性挑战与Matryoshka链解决方案

ARM虚拟化调试机制：HDFGWTR_EL2与HFGITR2_EL2详解

从提示式到自发式：AI心智理论的范式转变与实现路径

Kitty终端工具集：GPU加速与配置即代码的现代开发者利器

Claude Code 用户遭遇封号与 Token 不足时转向 Taotoken 的平滑迁移实践

医疗AI跨学科协作：从数据科学到临床实践的全流程实践指南