当前位置：首页 > article >正文

CANN/asc-devkit HCCL算法分析器指南

article 2026/5/22 9:34:50

Algorithm Analyzer User Guide【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkitTool IntroductionThe HCCL algorithm analyzer simulates HCCL algorithm execution in an offline environment. It verifies algorithm logic and memory operations, and efficiently executes test tasks to meet developer requirements.Principle IntroductionKey Points:The algorithm analyzer stubs the dependencies (hcomm and runtime interfaces) of the HCCL single operator execution flow. During algorithm execution, it captures Task sequences from all ranks.It organizes Task information from all ranks into adirected acyclic graph.It performs validations based ongraph algorithms, such as memory read-write conflict validation and semantic validation.Memory conflict validation analyzes whether potential read-write conflicts exist based on synchronization in the graph.Semantic validation simulates Task graph execution and recordsdata transfer information. After simulation completes, it checks whether thedata transfer informationin UserOutput memory meets the operator requirements.Environment PreparationFollow the environment preparation, source code download, compilation, and installation steps in Source Code Build to prepare for algorithm analyzer compilation.Test Case WritingLLT Test Case OverviewAn algorithm checker test case consists of 5 steps, as shown below. The following sections describe how to write each step to accommodate different operator requirements. Finally, it explains how to use the checker tool for issue diagnosis.LLT Test Case Step DetailsSimulation Model InitializationTopoMeta Structure IntroductionThe checker uses TopoMeta to represent a topology. TopoMeta is a three-layer vector structure.PhyDeviceId represents the physical ID of an NPU.ServerMeta consists of PhyDeviceIds and represents the number of cards in a server and their corresponding PhyDeviceIds.SuperPodMeta consists of ServerMetas and represents the servers that form a super node.TopoMeta represents the overall topology of the cluster.TopoMeta Generation MethodsThere are two ways to generate TopoMeta:Specify the number of super nodes, servers, and cards per server, then use the provided GenTopoMeta function to generate it. This applies to symmetric topology scenarios.Fully customize super nodes, servers, and card counts. This applies to both symmetric and asymmetric topology scenarios, as shown below.Model InitializationPass in the generated TopoMeta and specify the device type for simulation.Operator Parameter SettingsOperator Execution ParametersUsing Scatter as an example, you need to set some input parameters for executing the HcclScatter operator and validation. The specific parameters are:root: Set the root node. The Scatter operation distributes data from the root node in the communication domain evenly to other Ranks.rankSize: The number of Ranks participating in collective communication in this communication domain (must be consistent with the number of cards in topoMeta).recvCount: The amount of data each Rank receives from the root node.dataType: The data type corresponding to recvCount.For other operators or custom operator scenarios, set parameters according to the operator requirements.Set Environment VariablesEnvironment variables affect judgment logic in the code. Use the setenv function to set the required conditions before test case execution.Important NotesSupported operators: Currently only the scatter operator is supported.Supported modes: Currently only OPBASE single operator mode is supported.Supported device types: Currently only DEV_TYPE_910B and DEV_TYPE_91093 (represents DEV_TYPE_910C) are supported.Operator Execution FlowAs shown below, run the single operator flow in a multi-threaded manner.Construct operator input parameters.Construct the parameters required for single operator execution, including:SetDevice: Binds a thread to a Rank so that each thread simulates a corresponding Rank.Main stream resource creation: Call the aclrtCreateStream interface, with stub implementation to simulate stream resource creation.Communication domain initialization: Call HcclCommInitClusterInfo, with stub implementation to simulate communication domain creation.Input/output memory allocation: Call aclrtMalloc, with stub implementation to simulate memory creation and mark memory types. Users must calculate the required memory in bytes based on operator type, quantity, and data type.Operator dispatch.Call the HcclScatter operator and pass in the constructed parameters above. For custom operator scenarios, replace this with the custom operator API and modify the operator parameters above to match the custom operator requirements.Communication domain destruction.Call the HcclCommDestroy interface to destroy the communication domain.Result Graph ValidationGet the Task queue from all Ranks and call the corresponding operator validation function. For the Scatter operator, call CheckScatter and pass in the Task queue and the parameters required for Scatter operator validation. The gtest framework prints based on the validation result return value.Resource CleanupThe final step of a single test case execution is to clean up simulation model resources to avoid interference with the next test case execution.Test Case Filtering and DebuggingWhen there are many test cases and you only need to execute one, modify the test case name in main.cc.Test Case Compilation and ExecutionCompile and execute algorithm analyzer test cases:# Enter algorithm analyzer directory /hccl/test/st/algorithm cd ./hccl/test/st/algorithm # Compile test cases and automatically execute bash build.shResult ExampleTest case execution results are shown below:The meaning of each field:[run]: Indicates the test case being executed for validation[OK]: Indicates successful execution, validation passed[FAIL]: Indicates execution failure. Analyze the specific reason based on console logs.Issue DiagnosisMemory Conflict Validation Diagnosis MethodIssue PhenomenonMemory conflicts occur when a memory region between two synchronization signals is written concurrently by multiple tasks, or is written while being read. In actual runtime environments, this typically manifests as randomly occurring precision issues.Under the current Mesh structure, if a Reduce operator exists, false positives may occur. The reason is that under Mesh structure, a memory block may be written by other cards simultaneously within one synchronization. Hardware ensures the atomicity of Reduce operations, so no precision issues occur in actual runtime. However, from the checkers perspective, multiple read-write operations on the same memory between two synchronizations are detected, so it is flagged as an error.Except for the above scenario, if the following error appears, it indicates a memory conflict risk in task scheduling:[1]there is memory use confilict in two SliceMemoryStatus [2]one is startAddr is 0, size is 3200, status is WRITE. [3]another is startAddr is 0, size is 3200, status is WRITE. [4]failed to check memory BufferType::OUTPUT_CCL [5]memory conflict between node [rankId:1, queueId:0, index:1] and node [rankId:2, queueId:0, index:1] [6]check rank memory conflict failed for rank 0Lines 2 and 3 indicate the start address (startAddr), size, and read/write status (status) of the two conflicting memory blocks.status has two states: READ and WRITE. READ indicates the memory block is being read, WRITE indicates the memory block is being written. Being read and being written are abstract memory operation semantics, not just write task and read task.Memory blocks that may be in READ status include: localcopy task src, read task src, write task src. Memory blocks that may be in WRITE status include: localcopy task dst, read task dst, write task dst.Line 4 indicates the type of the conflicting memory block.Line 5 indicates which two tasks caused the memory conflict.Line 6 indicates the rank number where the memory conflict occurred.The above error log indicates that two tasks are simultaneously performing write operations to the range 0-3200 of OUTPUT_CCL type.Diagnosis MethodBased on the error log, find the two tasks that caused the memory conflict and investigate the synchronization scheduling before and after these two tasks.The error log in Issue Phenomenon indicates that two tasks are simultaneously performing write operations to the range 0-3200 of OUTPUT_CCL type.Semantic Validation Failure Diagnosis MethodSemantic Validation Basic ConceptsThe algorithm analyzer uses relative addresses to represent memory, composed of three fields: memory type, offset address, and size, represented by the DataSlice struct:class DataSlice { public: // Some method functions private: BufferType type; u64 offset; u64 size; }Memory supports types such as Input, Output, and CCL.Collective communication algorithms involve complex data transfer and reduction operations during execution. The algorithm analyzer usesBufferSemanticto recorddata transfer relationships, which includes a destination memory expression and multiple source memory expressions. The destination memory is represented by member variables startAddr and Size. The source memory is represented by the SrcBufDes struct, defined as follows:struct BufferSemantic { u64 startAddr; mutable u64 size; // Size, source and destination memory share the same size mutable bool isReduce; // Whether reduction is performed, true when srcBufs has multiple entries mutable HcclReduce0p reduceType; // Type of reduction operation mutable std::setSrcBufDes srcBufs; // Which rank(s) this data comes from }; struct SrcBufDes { RankId rankId; // Source rankId BufferType bufType; // Source memory type mutable u64 srcAddr; // Offset address relative to source memory type };Semantic Calculation ExampleThe following example explains what semantic calculation is.Initial state: There are two Ranks, Rank0 and Rank1, with two memory types, Input and Output.State one action: Transfer the data block from rank0s Input with offset address 20 and size 30 to rank0s Output with offset address 35. Result: A semantic block is generated on rank0s Output, recording this transfer information.State two action: Transfer the data block from rank1s Input with offset address 70 and size 15 to rank0s Output with offset address 50. Result: The destination memory overlaps with an existing semantic block, requiring the existing semantic block to be split, generating two semantic blocks.Result ValidationDuring semantic analysis execution, many semantic blocks are generated (recording many data transfer relationships). After execution completes, validate whether the semantic blocks in Output memory meet expectations.The following example uses 2-rank AllGather to illustrate normal and abnormal scenarios for semantic blocks in Rank0s Output memory. Assume input data size is 100 bytes.Correct Scenario:Error Scenario:Diagnosis ApproachThe semantic validation phase can detect two types of errors:Missing data.Incorrect data source.Extended to reduction scenarios, similar issues exist, such as missing ranks participating in reduction, inconsistent data offset addresses participating in reduction, and so on. Normally, when semantic errors occur, the system provides certain hints. You need to use these hints combined with the task sequence printed by the algorithm analyzer for specific analysis.【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANN/asc-devkit HCCL算法分析器指南

相关文章：

CANN/asc-devkit HCCL算法分析器指南

CANN/asc-devkit最新管理器模块

如何选择最佳身份验证技能：Awesome Agent Skills中Auth0、Firebase Auth与Better Auth全面指南

10分钟打造专业级科研图表：SciencePlots终极美化指南

3步解锁Beyond Compare 5专业版：Python密钥生成器终极指南

科研绘图革命：3步让Matplotlib图表达到期刊发表标准

清华大学打造实时交互视频生成新方案：让AI“边想边说“不再卡顿

CANN/asc-devkit RTC运行时编译指南

终极AMD Ryzen性能调优指南：5分钟掌握SMUDebugTool免费调试神器

深度技术解析：Lenovo Legion Toolkit 高级性能调优与系统集成指南

Windows Defender移除终极指南：如何彻底禁用微软安全组件提升系统性能30%

Python金融数据引擎：重构通达信数据获取的技术范式

DLSS Swapper完整指南：3分钟掌握游戏性能优化终极技巧

QQ空间数据备份指南：三步骤永久保存你的数字青春

华硕笔记本终极控制神器：G-Helper轻量化替代方案完整指南

DownGit：3分钟掌握GitHub文件下载的终极指南，无需克隆整个仓库！

Cobalt Strike 完整安装指南，含网盘资源与Java配置

QMCDecode：三步快速解密QQ音乐加密音频的免费工具

3分钟搞定M3U8视频下载：免费开源工具的终极懒人包

Python爬虫实战：从零编写一个健壮的静态页面抓取器！

工业设备数据采集太难？这款.NET8边缘网关,轻松搞定多协议对接

Python爬虫实战：构建博物馆藏品数字档案（列表到详情深度采集）

AI不是产品，是技术，Apple想明白了

米哈游游戏字体库终极指南：轻松获取11款精美架空文字字体资源

中兴光猫工厂模式智能解锁：3步获得完全控制权限

三步破解安全研发合规难题：Gitee软件工厂助力GJB5000B与等保三级高标准落地

抖音视频批量下载工具：免费保存去水印内容完整指南

终极微信聊天记录导出指南：用WeChatExporter彻底掌控你的数据主权

Sunshine游戏串流服务器：如何5分钟内搭建私人云游戏平台？

RAG 检索增强生成（全链路）