当前位置：首页 > article >正文

CANN / cann-learning-hub: Ascend C 算子工程化开发指南

article 2026/5/9 18:06:54

【免费下载链接】cann-learning-hubCANN 学习中心仓支持在线互动运行、边学边练提供教程、示例与优化方案一站式助力昇腾开发者快速上手。项目地址: https://gitcode.com/cann/cann-learning-hubname: ascendc-ops-project description: Ascend C 算子工程化开发技能。提供从工程创建、编译打包、安装部署到 aclnn API 测试的完整流程包含 Tiling 模板编程、属性、TBuf、Workspace 使用。触发创建算子工程、算子打包安装、工程化开发、aclnn API 测试、Tiling 模板编程、属性使用。Ascend C 算子工程化开发指南触发场景从零创建标准算子工程算子编译、打包、安装部署完整开发流程指导设计→实现→测试→部署使用 aclnn 二段式接口进行测试验证Tiling 模板编程、属性、TBuf、Workspace 使用⚠️ 铁律严格按照算子原型开发这是最重要的规则违反将导致算子无法正确集成到网络中什么是算子原型算子原型定义了算子的输入名称、类型、数量输出名称、类型、数量属性名称、类型、默认值API 签名必须遵守的规则规则说明错误示例正确做法输入不能当属性输入是运行时数据属性是编译时常量把axis输入改成属性保持axis为输入使用ValueDepend在 Tiling 阶段读取属性不能当输入属性值在编译时确定不能动态变化把dim属性改成输入张量保持dim为属性通过GetAttrs()获取不能增删输入输出必须与原型完全一致删除某个输入保留所有输入即使不使用不能修改数据类型dtype 必须与原型一致原型是 float16改成 float按原型支持的数据类型实现不能修改参数顺序API 参数顺序由原型决定调换输入顺序保持原型定义的顺序为什么必须严格遵守API 签名由原型决定aclnnOpGetWorkspaceSize的参数顺序和类型由原型自动生成网络集成依赖原型框架调用算子时按照原型传参测试代码依赖原型测试用例按照原型 API 编写⚠️ 铁律深入理解参考算子必须完全理解参考算子的所有行为包括特殊语义为什么这很重要题目通常会指定参考算子如torch.histc、torch.nn.functional.softmax要求完全复现其行为。如果对参考算子理解不完整会导致测试用例失败边界情况处理错误默认值语义错误关键步骤查阅官方文档不要仅凭直觉理解参数含义特别注意默认值的特殊语义理解所有边界情况理解默认值的真实含义默认值可能是一个标志而非有效值常见模式value0表示自动计算value-1表示使用默认行为valueNone表示可选参数测试边界情况如可能用参考算子测试特殊参数值验证自己的理解是否正确实战案例案例 1torch.histc 的默认值语义题目实现 Histogram 算子参考torch.histc错误理解// ❌ 认为 min0.0, max0.0 是有效范围 // 使用默认范围 [0.0, 1.0]正确理解查阅 PyTorch 文档# torch.histc(input, bins100, min0, max0) # 如果 min 0 and max 0自动使用 input 的 min 和 max正确实现// ✅ 检测特殊标志 if (min_val 0.0f max_val 0.0f) { // 动态计算数据的实际范围 need_compute_range true; } // 在 Kernel 中 if (need_compute_range) { // 遍历数据计算 min 和 max ComputeDataRange(); }案例 2axis-1 的特殊含义常见模式axis-1表示最后一维// ✅ 正确处理负值 int32_t axis attrs-GetInt(0); int32_t rank shape.GetDimNum(); if (axis 0) { axis axis rank; // 转换为正值 }检查清单查阅参考算子的官方文档理解所有参数的含义和默认值识别默认值的特殊语义是否是标志理解边界情况和特殊值处理用参考算子验证理解如可能常见错误案例// ❌ 错误把 axis 输入当成属性 this-Attr(axis).Int(0); // 错误 // ✅ 正确axis 是输入使用 ValueDepend 在 Tiling 阶段读取 this-Input(axis) .ParamType(REQUIRED) .DataType({ge::DT_INT32, ge::DT_INT64}) .ValueDepend(REQUIRED, DependScope::TILING); // 正确如何获取算子原型题目描述从题目或需求文档获取框架参考参考 PyTorch/TensorFlow 对应算子的原型已有实现参考 CANN 内置算子的原型定义Part 1: 快速开始 - 完整开发流程开发流程图需求分析 → 原型定义 → 工程生成 → Tiling设计 → Host实现 → Kernel实现 → 编译安装 → 测试验证关键步骤速查步骤文件关键点1. 原型定义op.json输入输出 dtype 数量一致2. 工程生成msopgen指定目标芯片3. Tiling结构*_tiling.h包含所有需要传递的参数4. Host实现op_host/*.cppTiling计算、属性读取、ValueDepend5. Kernel实现op_kernel/*.cpp模板类、dtype判断、多核切分6. 编译安装build.sh检查编译错误7. 测试验证test.cpp多shape、多dtype、多属性组合Part 2: 常见问题与解决方案重要2.1 ValueDepend - 输入张量在 Tiling 阶段读取问题当输入是张量如 axis而非属性时Tiling 阶段无法读取其值。⚠️ 重要限制ValueDepend 输入的 dtype 必须相同且只能是float、int64、bool之一错误示例ValueDepend input dtype must be the same and must be float, int64, or bool.解决方案// ❌ 错误int32 不支持 ValueDepend this-Input(axis) .DataType({ge::DT_INT32, ge::DT_INT32, ge::DT_INT32, ge::DT_INT32}) .ValueDepend(REQUIRED, DependScope::TILING); // 编译报错 // ❌ 错误dtype 不一致 this-Input(axis) .DataType({ge::DT_INT32, ge::DT_INT64, ge::DT_INT32, ge::DT_INT64}) .ValueDepend(REQUIRED, DependScope::TILING); // 编译报错 // ✅ 正确使用 int64dtype 一致 this-Input(axis) .DataType({ge::DT_INT64, ge::DT_INT64, ge::DT_INT64, ge::DT_INT64}) .ValueDepend(REQUIRED, DependScope::TILING); // 正确完整示例// Host 侧算子注册 this-Input(axis) .ParamType(REQUIRED) .DataType({ge::DT_INT64, ge::DT_INT64, ge::DT_INT64, ge::DT_INT64}) .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND}) .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND}) .ValueDepend(REQUIRED, DependScope::TILING); // Tiling 函数中读取 auto axis_tensor context-GetInputTensor(1); int32_t axis 0; if (axis_tensor ! nullptr) { const int64_t* data axis_tensor-GetDataint64_t(); if (data ! nullptr) { axis static_castint32_t(data[0]); } }注意配置 ValueDepend 后API 签名会变化原aclnnOpGetWorkspaceSize(x, axis_tensor, ...)变aclnnOpGetWorkspaceSize(x, axis_intarray, ...)2.2 多数据类型支持 - 模板实现问题Kernel 不支持 RTTI虚函数、继承无法用运行时多态。解决方案使用模板 dtype 参数// Tiling 数据中添加 dtype struct TilingData { uint32_t totalLength; int32_t dtype; // 数据类型 }; // Host 侧设置 dtype auto input_dtype context-GetInputDesc(0)-GetDataType(); tiling-dtype static_castint32_t(input_dtype); // Kernel 侧模板类 templatetypename T class KernelOp { // 使用 T 类型实现 }; // dtype 值参考 // DT_FLOAT 0, DT_FLOAT16 1, DT_INT8 2, DT_INT32 3 // Kernel 入口函数 extern C __global__ __aicore__ void op_kernel(...) { int32_t dtype tilingData.dtype; if (dtype 0) { // DT_FLOAT KernelOpfloat op; op.Init(...); op.Process(); } else if (dtype 1) { // DT_FLOAT16 KernelOphalf op; op.Init(...); op.Process(); } }2.3 Float16 精度问题问题half precision operation is not allowed in aicore function解决方案转换为 float32 计算template __aicore__ inline void KernelOphalf::ProcessForward(...) { float sum 0.0f; // 使用 float 累积 for (...) { half value_half xGm.GetValue(offset); float value (float)value_half; // 转换 sum sum value; yGm.SetValue(offset, (half)sum); // 写回 half } }2.4 多核切分问题问题多核执行时结果错误或竞争。解决方案// 方案1单核模式简单场景 uint32_t blockDim 1; context-SetBlockDim(blockDim); // 方案2正确切分需要仔细设计 // Tiling 中传递 blockDim tiling-blockDim blockDim; // Kernel 中使用 uint32_t blockIdx_ AscendC::GetBlockIdx(); uint32_t outerPerCore (outerLength blockDim - 1) / blockDim; uint32_t outerStart blockIdx_ * outerPerCore; uint32_t outerEnd std::min(outerStart outerPerCore, outerLength);2.5 原型定义 dtype 数量问题问题msopgen要求所有输入的 type 数量必须一致。解决方案{ input_desc: [ {name: x, type: [float16, float, int32, int8]}, {name: axis, type: [int32, int32, int32, int32]} // 数量一致 ] }Part 3: 创建算子工程Step 1: 创建算子原型定义文件基础示例[{ op: AddCustom, input_desc: [ {name: x, param_type: required, format: [ND, ND], type: [float16, float]}, {name: y, param_type: required, format: [ND, ND], type: [float16, float]} ], output_desc: [ {name: z, param_type: required, format: [ND, ND], type: [float16, float]} ] }]带属性示例[{ op: Clamp, input_desc: [{name: x, param_type: required, format: [ND], type: [float]}], output_desc: [{name: y, param_type: required, format: [ND], type: [float]}], attr: [ {name: min, type: float, param_type: optional, default_value: 0}, {name: max, type: float, param_type: optional, default_value: 1} ] }]Step 2: 创建算子工程msopgen gen -i op.json -c ai_core-ascend910b -lan cpp -out custom_opStep 3: 工程目录结构custom_op/ ├── op_host/ │ ├── op.cpp # Host 侧实现Tiling、形状推导 │ └── CMakeLists.txt ├── op_kernel/ │ ├── op_tiling.h # Tiling 数据结构 │ ├── op.cpp # Kernel 侧实现 │ └── CMakeLists.txt ├── CMakeLists.txt ├── CMakePresets.json └── build.shPart 4: Host 侧实现模板4.1 完整 Host 代码模板#include ../op_kernel/op_tiling.h #include register/op_def_registry.h #include tiling/platform/platform_ascendc.h namespace optiling { static ge::graphStatus TilingFunc(gert::TilingContext* context) { auto platform platform_ascendc::PlatformAscendC(context-GetPlatformInfo()); TilingData *tiling context-GetTilingDataTilingData(); // 1. 获取输入 shape const gert::StorageShape* x_shape context-GetInputShape(0); auto shape_dims x_shape-GetStorageShape(); int32_t rank shape_dims.GetDimNum(); // 2. 计算总元素数 uint32_t totalLength 1; for (int i 0; i rank; i) { totalLength * shape_dims.GetDim(i); } // 3. 获取属性如果有 auto attrs context-GetAttrs(); bool attr1 false; if (attrs ! nullptr attrs-GetBool(0) ! nullptr) { attr1 *attrs-GetBool(0); } // 4. 获取输入张量数据如果配置了 ValueDepend auto input_tensor context-GetInputTensor(1); int32_t input_value 0; if (input_tensor ! nullptr) { const int32_t* data input_tensor-GetDataint32_t(); if (data ! nullptr) input_value data[0]; } // 5. 获取数据类型 auto input_dtype context-GetInputDesc(0)-GetDataType(); // 6. 设置 Tiling 数据 tiling-totalLength totalLength; tiling-dtype static_castint32_t(input_dtype); // ... 其他参数 // 7. 设置核数 uint32_t coreNum platform.GetCoreNum(); uint32_t blockDim 1; // 或根据计算量设置 context-SetBlockDim(blockDim); tiling-blockDim blockDim; // 8. 设置 workspace size_t *currentWorkspace context-GetWorkspaceSizes(1); currentWorkspace[0] 0; // 或设置需要的大小 return ge::GRAPH_SUCCESS; } } namespace ge { static ge::graphStatus InferShape(gert::InferShapeContext* context) { const gert::Shape* x_shape context-GetInputShape(0); gert::Shape* y_shape context-GetOutputShape(0); *y_shape *x_shape; return GRAPH_SUCCESS; } static ge::graphStatus InferDataType(gert::InferDataTypeContext *context) { const auto inputDataType context-GetInputDataType(0); context-SetOutputDataType(0, inputDataType); return ge::GRAPH_SUCCESS; } } namespace ops { class Op : public OpDef { public: explicit Op(const char* name) : OpDef(name) { this-Input(x) .ParamType(REQUIRED) .DataType({ge::DT_FLOAT16, ge::DT_FLOAT, ge::DT_INT32, ge::DT_INT8}) .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND}) .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND}); this-Output(y) .ParamType(REQUIRED) .DataType({ge::DT_FLOAT16, ge::DT_FLOAT, ge::DT_INT32, ge::DT_INT8}) .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND}) .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND}); // 可选属性 this-Attr(attr1).AttrType(OPTIONAL).Bool(false); this-SetInferShape(ge::InferShape).SetInferDataType(ge::InferDataType); this-AICore().SetTiling(optiling::TilingFunc); this-AICore().AddConfig(ascend910b); } }; OP_ADD(Op); }Part 5: Kernel 侧实现模板5.1 完整 Kernel 代码模板#include kernel_operator.h #include op_tiling.h templatetypename T class KernelOp { public: __aicore__ inline KernelOp() {} __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, uint32_t totalLength, uint32_t blockDim) { this-totalLength totalLength; this-blockDim blockDim; blockIdx_ AscendC::GetBlockIdx(); xGm.SetGlobalBuffer((__gm__ T *)x, totalLength); yGm.SetGlobalBuffer((__gm__ T *)y, totalLength); } __aicore__ inline void Process() { uint32_t lengthPerCore (totalLength blockDim - 1) / blockDim; uint32_t start blockIdx_ * lengthPerCore; uint32_t end (start lengthPerCore totalLength) ? (start lengthPerCore) : totalLength; for (uint32_t i start; i end; i) { T value xGm.GetValue(i); // 计算逻辑 yGm.SetValue(i, value); } } private: AscendC::GlobalTensorT xGm; AscendC::GlobalTensorT yGm; uint32_t totalLength; uint32_t blockDim; uint32_t blockIdx_; }; // Float16 特化需要 float32 中间计算 template __aicore__ inline void KernelOphalf::Process() { uint32_t lengthPerCore (totalLength blockDim - 1) / blockDim; uint32_t start blockIdx_ * lengthPerCore; uint32_t end (start lengthPerCore totalLength) ? (start lengthPerCore) : totalLength; for (uint32_t i start; i end; i) { half value_half xGm.GetValue(i); float value (float)value_half; // 使用 float 计算 yGm.SetValue(i, (half)value); } } extern C __global__ __aicore__ void op_kernel(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling) { REGISTER_TILING_DEFAULT(TilingData); GET_TILING_DATA_WITH_STRUCT(TilingData, tilingData, tiling); int32_t dtype tilingData.dtype; if (dtype 0) { // DT_FLOAT KernelOpfloat op; op.Init(x, y, tilingData.totalLength, tilingData.blockDim); op.Process(); } else if (dtype 1) { // DT_FLOAT16 KernelOphalf op; op.Init(x, y, tilingData.totalLength, tilingData.blockDim); op.Process(); } else if (dtype 3) { // DT_INT32 KernelOpint32_t op; op.Init(x, y, tilingData.totalLength, tilingData.blockDim); op.Process(); } else if (dtype 2) { // DT_INT8 KernelOpint8_t op; op.Init(x, y, tilingData.totalLength, tilingData.blockDim); op.Process(); } }Part 6: 测试代码模板6.1 完整测试代码框架#include iostream #include vector #include cmath #include acl/acl.h #include aclnn/acl_meta.h #include aclnn_op.h #define CHECK_ACL(ret) do { \ if (ret ! ACL_SUCCESS) { \ std::cerr ACL 错误: ret 在第 __LINE__ 行 std::endl; \ return false; \ } \ } while(0) // 参考实现 templatetypename T std::vectorT ref_impl(const std::vectorT input, /* params */) { std::vectorT output(input.size()); // 实现参考算法 return output; } // 结果比较 templatetypename T bool compare_result(const std::vectorT output, const std::vectorT expected, float rtol 1e-3, float atol 1e-3) { for (size_t i 0; i output.size(); i) { float diff std::abs((float)output[i] - (float)expected[i]); float tolerance atol rtol * std::abs((float)expected[i]); if (diff tolerance) { std::cerr 不匹配在索引 i : 得到 (float)output[i] , 期望 (float)expected[i] std::endl; return false; } } return true; } // 辅助函数 std::vectorint64_t compute_stride(const std::vectorint64_t shape) { std::vectorint64_t stride(shape.size()); stride[shape.size() - 1] 1; for (int i shape.size() - 2; i 0; i--) { stride[i] stride[i 1] * shape[i 1]; } return stride; } int64_t compute_total(const std::vectorint64_t shape) { int64_t total 1; for (auto d : shape) total * d; return total; } // 测试用例 bool test_case(const std::vectorint64_t shape, int64_t axis, bool exclusive, bool reverse, const std::string name) { int64_t total compute_total(shape); // 准备数据 std::vectorfloat input(total); for (int64_t i 0; i total; i) input[i] (float)i; auto expected ref_impl(input, axis, exclusive, reverse, shape); // 分配设备内存 float* x_dev nullptr; float* y_dev nullptr; CHECK_ACL(aclrtMalloc((void**)x_dev, total * sizeof(float), ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void**)y_dev, total * sizeof(float), ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMemcpy(x_dev, total * sizeof(float), input.data(), total * sizeof(float), ACL_MEMCPY_HOST_TO_DEVICE)); // 创建张量 auto stride compute_stride(shape); aclTensor* x_tensor aclCreateTensor(shape.data(), shape.size(), ACL_FLOAT, stride.data(), 0, ACL_FORMAT_ND, shape.data(), shape.size(), x_dev); aclIntArray* axis_array aclCreateIntArray(axis, 1); aclTensor* y_tensor aclCreateTensor(shape.data(), shape.size(), ACL_FLOAT, stride.data(), 0, ACL_FORMAT_ND, shape.data(), shape.size(), y_dev); // 调用算子 uint64_t workspaceSize 0; aclOpExecutor* executor nullptr; CHECK_ACL(aclnnOpGetWorkspaceSize(x_tensor, axis_array, exclusive, reverse, y_tensor, workspaceSize, executor)); void* workspace nullptr; if (workspaceSize 0) { CHECK_ACL(aclrtMalloc(workspace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST)); } aclrtStream stream nullptr; CHECK_ACL(aclrtCreateStream(stream)); CHECK_ACL(aclnnOp(workspace, workspaceSize, executor, stream)); CHECK_ACL(aclrtSynchronizeStream(stream)); // 获取结果 std::vectorfloat output(total); CHECK_ACL(aclrtMemcpy(output.data(), total * sizeof(float), y_dev, total * sizeof(float), ACL_MEMCPY_DEVICE_TO_HOST)); // 验证 bool pass compare_result(output, expected); std::cout name : (pass ? ✓ : ✗) std::endl; // 清理 if (workspace) CHECK_ACL(aclrtFree(workspace)); CHECK_ACL(aclrtDestroyStream(stream)); CHECK_ACL(aclrtFree(x_dev)); CHECK_ACL(aclrtFree(y_dev)); return pass; } int main() { CHECK_ACL(aclInit(nullptr)); CHECK_ACL(aclrtSetDevice(0)); aclrtContext context; CHECK_ACL(aclrtCreateContext(context, 0)); CHECK_ACL(aclrtSetCurrentContext(context)); int passed 0, failed 0; // 测试不同 shape if (test_case({2, 3, 4}, 1, false, false, basic)) passed; else failed; if (test_case({128, 256}, 1, false, false, large_2d)) passed; else failed; if (test_case({16, 32, 64}, 1, false, false, large_3d)) passed; else failed; // 测试不同 axis if (test_case({2, 3, 4}, 0, false, false, axis0)) passed; else failed; if (test_case({2, 3, 4}, -1, false, false, axis_neg)) passed; else failed; // 测试不同属性 if (test_case({2, 3, 4}, 1, true, false, exclusive)) passed; else failed; if (test_case({2, 3, 4}, 1, false, true, reverse)) passed; else failed; CHECK_ACL(aclrtDestroyContext(context)); CHECK_ACL(aclrtResetDevice(0)); CHECK_ACL(aclFinalize()); std::cout 通过: passed , 失败: failed std::endl; return failed 0 ? 0 : -1; }Part 7: 编译打包安装测试7.1 编译安装命令# 编译 cd custom_op bash build.sh # 安装 ./build_out/custom_opp_ubuntu_aarch64.run --install-path${HOME} --quiet # 设置环境 source ${HOME}/vendors/customize/bin/set_env.bash7.2 测试编译命令source ${HOME}/vendors/customize/bin/set_env.bash g -o test_op test_op.cpp \ -I${ASCEND_HOME_PATH}/include \ -I${ASCEND_HOME_PATH}/include/acl \ -I${HOME}/vendors/customize/op_api/include \ -L${ASCEND_HOME_PATH}/aarch64-linux/lib64 \ -L${HOME}/vendors/customize/op_api/lib \ -L${HOME}/vendors/customize/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64 \ -lascendcl -lnnopbase -lcust_opapi -lcust_opmaster_rt2.0 \ -stdc17 \ -Wl,-rpath,${ASCEND_HOME_PATH}/aarch64-linux/lib64 \ -Wl,-rpath,${HOME}/vendors/customize/op_api/lib \ -Wl,-rpath,${HOME}/vendors/customize/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64 ./test_opPart 8: Tiling 模板编程高级8.1 什么是 TilingKeyTilingKey 用于区分不同的 kernel 实现分支编译时会根据不同 TilingKey 形成不同二进制算子 om 文件。8.2 Tiling 模板文件示例op_kernel/tiling_key_op.h:#ifndef TILING_KEY_OP_H #define TILING_KEY_OP_H #include ascendc/host_api/tiling/template_argument.h ASCENDC_TPL_ARGS_DECL(Op, ASCENDC_TPL_DATATYPE_DECL(D_T_X, C_DT_FLOAT, C_DT_FLOAT16, ASCENDC_TPL_INPUT(0)), ASCENDC_TPL_UINT_DECL(TILE_NUM, ASCENDC_TPL_8_BW, ASCENDC_TPL_UI_MIX, 1, 2, 4, 8), ); ASCENDC_TPL_SEL( ASCENDC_TPL_ARGS_SEL(ASCENDC_TPL_DATATYPE_SEL(D_T_X, C_DT_FLOAT)), ASCENDC_TPL_ARGS_SEL(ASCENDC_TPL_DATATYPE_SEL(D_T_X, C_DT_FLOAT16)), ); #endifPart 9: Workspace 和 TBuf9.1 Workspace 设置// Host 侧 auto platform platform_ascendc::PlatformAscendC(context-GetPlatformInfo()); size_t userWorkspaceSize 256 * 4; size_t systemWorkspaceSize platform.GetLibApiWorkSpaceSize(); size_t *currentWorkspace context-GetWorkspaceSizes(1); currentWorkspace[0] userWorkspaceSize systemWorkspaceSize;9.2 TBuf 使用class KernelOp { private: AscendC::TPipe pipe; AscendC::TBufAscendC::TPosition::VECCALC tmpBuf; }; __aicore__ inline void Init() { pipe.InitBuffer(tmpBuf, 256 * sizeof(float)); } __aicore__ inline void Compute() { AscendC::LocalTensorfloat tmp tmpBuf.Getfloat(); // 使用 tmp不要 EnQue/DeQue/FreeTensor }Part 10: 数据类型枚举值枚举名值C类型DT_FLOAT0floatDT_FLOAT161halfDT_INT82int8_tDT_INT323int32_tDT_INT644int64_tDT_UINT85uint8_tDT_BOOL6boolPart 11: 检查清单⚠️ 算子原型合规检查最重要输入/输出数量与原型一致输入/输出名称与原型一致数据类型与原型一致属性名称、类型、默认值与原型一致没有把输入当成属性没有把属性当成输入API 调用参数顺序正确工程创建算子原型 json 文件正确dtype 数量一致工程目录结构完整Host 侧Tiling 数据结构包含所有参数ValueDepend 配置正确如需要属性获取索引正确blockDim 设置合理Kernel 侧使用模板支持多数据类型Float16 使用 float32 中间计算多核切分逻辑正确无虚函数/继承测试验证多种 shape 测试多种 dtype 测试多种属性组合测试边界情况测试Part 12: API 最佳实践速查详细内容见 references/api_best_practices.md核心要点API 黑名单禁止使用GlobalTensor::SetValue()和GetValue()DataCopy vs DataCopyPad优先使用 DataCopyPad自动处理非对齐TBuf vs TQueMTE 搬运用 TQue纯计算用 TBufDouble Buffer在InitBuffer的num参数中设置与模板depth无关repeatTime 限制uint8_t 类型最大 255超过需分批处理Part 13: Tiling 设计速查详细内容见 references/tiling_design.md算子分类类别特征典型算子Tiling 复杂度ElementwiseShape相同逐元素独立Sin, Cos, Abs⭐ 简单Reduction沿轴归约Softmax, LayerNorm⭐⭐⭐ 复杂BroadcastShape不同需广播Add, Mul⭐⭐ 中等通用设计要素所有算子必须完成多核切分任务分配给多个 AI CoreUB 切分单次处理数据量A2/A3: 192KBBuffer 规划输入/输出/中间 buffer分支覆盖dtype/shape/对齐/边界Part 14: 精度验证标准速查详细内容见 references/precision_standard.md精度标准选择浮点计算类MERE/MARE Threshold最常用整数计算类精确匹配非计算类精确匹配浮点精度阈值数据类型Threshold数值FLOAT162^-100.000977FLOAT322^-130.000122BFLOAT162^-70.00781通过标准MERE ThresholdMARE 10 * Threshold参考资源内部文档API 最佳实践速查 - API 使用规范、黑名单、常见错误Tiling 设计速查 - 算子分类、设计要素、模板精度验证标准速查 - 精度标准、验证方法、测试设计CMake 配置指南 - CMakeLists.txt 和 CMakePresets.json 配置完整示例AddCustom 示例 - 简单逐元素算子Clamp 示例 - 带属性算子代码模板templates/op_host_template.cpp- Host 侧实现模板templates/op_kernel_template.cpp- Kernel 侧实现模板templates/tiling_header_template.h- Tiling 数据结构模板templates/test_template.cpp- 测试代码模板templates/build.sh.template- 编译脚本模板templates/CMakeLists.txt.template- CMakeLists.txt 模板templates/CMakePresets.json.template- CMakePresets.json 模板【免费下载链接】cann-learning-hubCANN 学习中心仓支持在线互动运行、边学边练提供教程、示例与优化方案一站式助力昇腾开发者快速上手。项目地址: https://gitcode.com/cann/cann-learning-hub创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANN / cann-learning-hub: Ascend C 算子工程化开发指南

相关文章：

CANN / cann-learning-hub: Ascend C 算子工程化开发指南

XUnity.AutoTranslator：5分钟掌握Unity游戏实时翻译的完整指南

AI智能体开发实战：基于agent-sdk构建可扩展的智能应用

基于verl框架和代码沙盒环境工具调用的代码强化学习实践

美欧AI治理法案对比：从核心理念到企业合规实操全解析

nli-MiniLM2-L6-H768在舆情分析中的实战：识别观点冲突与一致性

Gemma-3-12B-IT实战体验：搭建企业内部AI助手完整指南

[实战指南] 2026年工程图纸数字化与检验计划自动化的技术路径

强化学习新范式：文化累积与跨代智能进化技术解析

DriverStore Explorer：Windows驱动管理专家，让系统重获新生

2026年制造业数字化质量管理实务：从图纸识别到检验计划自动化

AI黑箱与法律归责：可解释性技术如何破解算法决策责任困境

科研影响力评估：从引文指标到AI预测的量化方法与实践

别再傻傻分不清了！FreeRTOS事件组与任务通知的保姆级对比与实战选型指南

农业物联网融合智能：生物信号与AI协同的精准决策实践

3个技巧彻底解决Windows右键菜单臃肿问题：ContextMenuManager实战指南

别再只测THD了！音频功放测试中，工程师最容易忽略的3个关键点（附实测数据）

GANs生成对抗网络：原理、实现与优化指南

从零开始：Switch大气层系统完整配置指南

传统密码协议（秘密共享协议）

告别网盘限速烦恼！九大平台直链下载助手LinkSwift使用全攻略

CANN PTO集合通信指令详解

使用Taotoken后开发团队在模型API调用稳定性与延迟方面的实际体验分享

腾讯元宝能生成带公式的WORD文档吗?

考试复习录音整理太慢还听不清不会整理？可参考这套标准化整理流程

手动记待办太慢写不完还整理不清？待办生成该这么用

LangGraph 中的记忆与上下文管理：让智能体不“失忆”

SRv6-BE配置实战：从基础到验证，【Bluedroid】A2dp Source播放流程源码分析(10)：音频传输与SBC编码机制深度解析（a2dp_sbc_send_frames）。

转：为什么你的企业文化做了半天，却还是流于表面？

渲染引擎与性能拆解：自绘vs原生渲染vs Bridge的终极对决｜跨平台框架深度对决②