当前位置: 首页 > article >正文

CANN/asc-devkit HCCL算法分析器指南

Algorithm Analyzer User Guide【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkitTool IntroductionThe HCCL algorithm analyzer simulates HCCL algorithm execution in an offline environment. It verifies algorithm logic and memory operations, and efficiently executes test tasks to meet developer requirements.Principle IntroductionKey Points:The algorithm analyzer stubs the dependencies (hcomm and runtime interfaces) of the HCCL single operator execution flow. During algorithm execution, it captures Task sequences from all ranks.It organizes Task information from all ranks into adirected acyclic graph.It performs validations based ongraph algorithms, such as memory read-write conflict validation and semantic validation.Memory conflict validation analyzes whether potential read-write conflicts exist based on synchronization in the graph.Semantic validation simulates Task graph execution and recordsdata transfer information. After simulation completes, it checks whether thedata transfer informationin UserOutput memory meets the operator requirements.Environment PreparationFollow the environment preparation, source code download, compilation, and installation steps in Source Code Build to prepare for algorithm analyzer compilation.Test Case WritingLLT Test Case OverviewAn algorithm checker test case consists of 5 steps, as shown below. The following sections describe how to write each step to accommodate different operator requirements. Finally, it explains how to use the checker tool for issue diagnosis.LLT Test Case Step DetailsSimulation Model InitializationTopoMeta Structure IntroductionThe checker uses TopoMeta to represent a topology. TopoMeta is a three-layer vector structure.PhyDeviceId represents the physical ID of an NPU.ServerMeta consists of PhyDeviceIds and represents the number of cards in a server and their corresponding PhyDeviceIds.SuperPodMeta consists of ServerMetas and represents the servers that form a super node.TopoMeta represents the overall topology of the cluster.TopoMeta Generation MethodsThere are two ways to generate TopoMeta:Specify the number of super nodes, servers, and cards per server, then use the provided GenTopoMeta function to generate it. This applies to symmetric topology scenarios.Fully customize super nodes, servers, and card counts. This applies to both symmetric and asymmetric topology scenarios, as shown below.Model InitializationPass in the generated TopoMeta and specify the device type for simulation.Operator Parameter SettingsOperator Execution ParametersUsing Scatter as an example, you need to set some input parameters for executing the HcclScatter operator and validation. The specific parameters are:root: Set the root node. The Scatter operation distributes data from the root node in the communication domain evenly to other Ranks.rankSize: The number of Ranks participating in collective communication in this communication domain (must be consistent with the number of cards in topoMeta).recvCount: The amount of data each Rank receives from the root node.dataType: The data type corresponding to recvCount.For other operators or custom operator scenarios, set parameters according to the operator requirements.Set Environment VariablesEnvironment variables affect judgment logic in the code. Use the setenv function to set the required conditions before test case execution.Important NotesSupported operators: Currently only the scatter operator is supported.Supported modes: Currently only OPBASE single operator mode is supported.Supported device types: Currently only DEV_TYPE_910B and DEV_TYPE_91093 (represents DEV_TYPE_910C) are supported.Operator Execution FlowAs shown below, run the single operator flow in a multi-threaded manner.Construct operator input parameters.Construct the parameters required for single operator execution, including:SetDevice: Binds a thread to a Rank so that each thread simulates a corresponding Rank.Main stream resource creation: Call the aclrtCreateStream interface, with stub implementation to simulate stream resource creation.Communication domain initialization: Call HcclCommInitClusterInfo, with stub implementation to simulate communication domain creation.Input/output memory allocation: Call aclrtMalloc, with stub implementation to simulate memory creation and mark memory types. Users must calculate the required memory in bytes based on operator type, quantity, and data type.Operator dispatch.Call the HcclScatter operator and pass in the constructed parameters above. For custom operator scenarios, replace this with the custom operator API and modify the operator parameters above to match the custom operator requirements.Communication domain destruction.Call the HcclCommDestroy interface to destroy the communication domain.Result Graph ValidationGet the Task queue from all Ranks and call the corresponding operator validation function. For the Scatter operator, call CheckScatter and pass in the Task queue and the parameters required for Scatter operator validation. The gtest framework prints based on the validation result return value.Resource CleanupThe final step of a single test case execution is to clean up simulation model resources to avoid interference with the next test case execution.Test Case Filtering and DebuggingWhen there are many test cases and you only need to execute one, modify the test case name in main.cc.Test Case Compilation and ExecutionCompile and execute algorithm analyzer test cases:# Enter algorithm analyzer directory /hccl/test/st/algorithm cd ./hccl/test/st/algorithm # Compile test cases and automatically execute bash build.shResult ExampleTest case execution results are shown below:The meaning of each field:[run]: Indicates the test case being executed for validation[OK]: Indicates successful execution, validation passed[FAIL]: Indicates execution failure. Analyze the specific reason based on console logs.Issue DiagnosisMemory Conflict Validation Diagnosis MethodIssue PhenomenonMemory conflicts occur when a memory region between two synchronization signals is written concurrently by multiple tasks, or is written while being read. In actual runtime environments, this typically manifests as randomly occurring precision issues.Under the current Mesh structure, if a Reduce operator exists, false positives may occur. The reason is that under Mesh structure, a memory block may be written by other cards simultaneously within one synchronization. Hardware ensures the atomicity of Reduce operations, so no precision issues occur in actual runtime. However, from the checkers perspective, multiple read-write operations on the same memory between two synchronizations are detected, so it is flagged as an error.Except for the above scenario, if the following error appears, it indicates a memory conflict risk in task scheduling:[1]there is memory use confilict in two SliceMemoryStatus [2]one is startAddr is 0, size is 3200, status is WRITE. [3]another is startAddr is 0, size is 3200, status is WRITE. [4]failed to check memory BufferType::OUTPUT_CCL [5]memory conflict between node [rankId:1, queueId:0, index:1] and node [rankId:2, queueId:0, index:1] [6]check rank memory conflict failed for rank 0Lines 2 and 3 indicate the start address (startAddr), size, and read/write status (status) of the two conflicting memory blocks.status has two states: READ and WRITE. READ indicates the memory block is being read, WRITE indicates the memory block is being written. Being read and being written are abstract memory operation semantics, not just write task and read task.Memory blocks that may be in READ status include: localcopy task src, read task src, write task src. Memory blocks that may be in WRITE status include: localcopy task dst, read task dst, write task dst.Line 4 indicates the type of the conflicting memory block.Line 5 indicates which two tasks caused the memory conflict.Line 6 indicates the rank number where the memory conflict occurred.The above error log indicates that two tasks are simultaneously performing write operations to the range 0-3200 of OUTPUT_CCL type.Diagnosis MethodBased on the error log, find the two tasks that caused the memory conflict and investigate the synchronization scheduling before and after these two tasks.The error log in Issue Phenomenon indicates that two tasks are simultaneously performing write operations to the range 0-3200 of OUTPUT_CCL type.Semantic Validation Failure Diagnosis MethodSemantic Validation Basic ConceptsThe algorithm analyzer uses relative addresses to represent memory, composed of three fields: memory type, offset address, and size, represented by the DataSlice struct:class DataSlice { public: // Some method functions private: BufferType type; u64 offset; u64 size; }Memory supports types such as Input, Output, and CCL.Collective communication algorithms involve complex data transfer and reduction operations during execution. The algorithm analyzer usesBufferSemanticto recorddata transfer relationships, which includes a destination memory expression and multiple source memory expressions. The destination memory is represented by member variables startAddr and Size. The source memory is represented by the SrcBufDes struct, defined as follows:struct BufferSemantic { u64 startAddr; mutable u64 size; // Size, source and destination memory share the same size mutable bool isReduce; // Whether reduction is performed, true when srcBufs has multiple entries mutable HcclReduce0p reduceType; // Type of reduction operation mutable std::setSrcBufDes srcBufs; // Which rank(s) this data comes from }; struct SrcBufDes { RankId rankId; // Source rankId BufferType bufType; // Source memory type mutable u64 srcAddr; // Offset address relative to source memory type };Semantic Calculation ExampleThe following example explains what semantic calculation is.Initial state: There are two Ranks, Rank0 and Rank1, with two memory types, Input and Output.State one action: Transfer the data block from rank0s Input with offset address 20 and size 30 to rank0s Output with offset address 35. Result: A semantic block is generated on rank0s Output, recording this transfer information.State two action: Transfer the data block from rank1s Input with offset address 70 and size 15 to rank0s Output with offset address 50. Result: The destination memory overlaps with an existing semantic block, requiring the existing semantic block to be split, generating two semantic blocks.Result ValidationDuring semantic analysis execution, many semantic blocks are generated (recording many data transfer relationships). After execution completes, validate whether the semantic blocks in Output memory meet expectations.The following example uses 2-rank AllGather to illustrate normal and abnormal scenarios for semantic blocks in Rank0s Output memory. Assume input data size is 100 bytes.Correct Scenario:Error Scenario:Diagnosis ApproachThe semantic validation phase can detect two types of errors:Missing data.Incorrect data source.Extended to reduction scenarios, similar issues exist, such as missing ranks participating in reduction, inconsistent data offset addresses participating in reduction, and so on. Normally, when semantic errors occur, the system provides certain hints. You need to use these hints combined with the task sequence printed by the algorithm analyzer for specific analysis.【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

相关文章:

CANN/asc-devkit HCCL算法分析器指南

Algorithm Analyzer User Guide 【免费下载链接】asc-devkit 本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言,原生支持C和C标准规范,主要由类库和语言扩展层构成,提供多层级API,满足多维场景算子开发诉求。 项目地址: …...

CANN/asc-devkit最新管理器模块

latest_manager Module Description 【免费下载链接】asc-devkit 本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言,原生支持C和C标准规范,主要由类库和语言扩展层构成,提供多层级API,满足多维场景算子开发诉求。 项目地…...

如何选择最佳身份验证技能:Awesome Agent Skills中Auth0、Firebase Auth与Better Auth全面指南

如何选择最佳身份验证技能:Awesome Agent Skills中Auth0、Firebase Auth与Better Auth全面指南 【免费下载链接】awesome-agent-skills A curated collection of 1000 agent skills from official dev teams and the community, compatible with Claude Code, Codex…...

10分钟打造专业级科研图表:SciencePlots终极美化指南

10分钟打造专业级科研图表:SciencePlots终极美化指南 【免费下载链接】SciencePlots Matplotlib styles for scientific plotting 项目地址: https://gitcode.com/gh_mirrors/sc/SciencePlots 还在为科研论文中的图表不够专业而烦恼吗?SciencePlo…...

3步解锁Beyond Compare 5专业版:Python密钥生成器终极指南

3步解锁Beyond Compare 5专业版:Python密钥生成器终极指南 【免费下载链接】BCompare_Keygen Keygen for BCompare 5 项目地址: https://gitcode.com/gh_mirrors/bc/BCompare_Keygen 还在为Beyond Compare 5的30天试用期而烦恼吗?想免费使用这款强…...

科研绘图革命:3步让Matplotlib图表达到期刊发表标准

科研绘图革命:3步让Matplotlib图表达到期刊发表标准 【免费下载链接】SciencePlots Matplotlib styles for scientific plotting 项目地址: https://gitcode.com/gh_mirrors/sc/SciencePlots 想象一下这样的场景:你花了数周时间收集数据、编写分析…...

清华大学打造实时交互视频生成新方案:让AI“边想边说“不再卡顿

这项由清华大学与人民大学联合开展的研究,于2026年5月以预印本形式发布,论文编号为arXiv:2605.15141,有兴趣深入了解的读者可通过该编号查询完整论文。研究团队来自清华大学和生数科技(ShengShu),与人民大学…...

CANN/asc-devkit RTC运行时编译指南

RTC 【免费下载链接】asc-devkit 本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言,原生支持C和C标准规范,主要由类库和语言扩展层构成,提供多层级API,满足多维场景算子开发诉求。 项目地址: https://gitcode.com/cann/a…...

终极AMD Ryzen性能调优指南:5分钟掌握SMUDebugTool免费调试神器

终极AMD Ryzen性能调优指南:5分钟掌握SMUDebugTool免费调试神器 【免费下载链接】SMUDebugTool A dedicated tool to help write/read various parameters of Ryzen-based systems, such as manual overclock, SMU, PCI, CPUID, MSR and Power Table. 项目地址: h…...

深度技术解析:Lenovo Legion Toolkit 高级性能调优与系统集成指南

深度技术解析:Lenovo Legion Toolkit 高级性能调优与系统集成指南 【免费下载链接】LenovoLegionToolkit Lightweight Lenovo Vantage and Hotkeys replacement for Lenovo Legion laptops. 项目地址: https://gitcode.com/gh_mirrors/le/LenovoLegionToolkit …...

Windows Defender移除终极指南:如何彻底禁用微软安全组件提升系统性能30%

Windows Defender移除终极指南:如何彻底禁用微软安全组件提升系统性能30% 【免费下载链接】windows-defender-remover A tool which is uses to remove Windows Defender in Windows 8.x, Windows 10 (every version) and Windows 11. 项目地址: https://gitcode.…...

Python金融数据引擎:重构通达信数据获取的技术范式

Python金融数据引擎:重构通达信数据获取的技术范式 【免费下载链接】mootdx 通达信数据读取的一个简便使用封装 项目地址: https://gitcode.com/GitHub_Trending/mo/mootdx 在量化投资和金融数据分析领域,数据获取一直是开发者面临的首要挑战。传…...

DLSS Swapper完整指南:3分钟掌握游戏性能优化终极技巧

DLSS Swapper完整指南:3分钟掌握游戏性能优化终极技巧 【免费下载链接】dlss-swapper 项目地址: https://gitcode.com/GitHub_Trending/dl/dlss-swapper DLSS Swapper是一款革命性的开源工具,专门为PC游戏玩家设计,让你能够轻松管理、…...

QQ空间数据备份指南:三步骤永久保存你的数字青春

QQ空间数据备份指南:三步骤永久保存你的数字青春 【免费下载链接】QZoneExport QQ空间导出助手,用于备份QQ空间的说说、日志、私密日记、相册、视频、留言板、QQ好友、收藏夹、分享、最近访客为文件,便于迁移与保存 项目地址: https://gitc…...

华硕笔记本终极控制神器:G-Helper轻量化替代方案完整指南

华硕笔记本终极控制神器:G-Helper轻量化替代方案完整指南 【免费下载链接】g-helper Lightweight Armoury Crate alternative for Asus laptops with nearly the same functionality. Works with ROG Zephyrus, Flow, TUF, Strix, Scar, ProArt, Vivobook, Zenbook,…...

DownGit:3分钟掌握GitHub文件下载的终极指南,无需克隆整个仓库!

DownGit:3分钟掌握GitHub文件下载的终极指南,无需克隆整个仓库! 【免费下载链接】DownGit github 资源打包下载工具 项目地址: https://gitcode.com/gh_mirrors/dow/DownGit 你是否曾经为了下载GitHub上的一个配置文件,却被…...

Cobalt Strike 完整安装指南,含网盘资源与Java配置

Cobalt Strike安装教程 说明: 本教程仅用于学习与研究,请勿用于非法用途。 kali安装java环境参考(如有侵权联系删除) https://blog.csdn.net/weixin_54499207/article/details/144985879?sharetypeblog&shareId144985879&…...

QMCDecode:三步快速解密QQ音乐加密音频的免费工具

QMCDecode:三步快速解密QQ音乐加密音频的免费工具 【免费下载链接】QMCDecode QQ音乐QMC格式转换为普通格式(qmcflac转flac,qmc0,qmc3转mp3, mflac,mflac0等转flac),仅支持macOS,可自动识别到QQ音乐下载目录,默认转换结…...

3分钟搞定M3U8视频下载:免费开源工具的终极懒人包

3分钟搞定M3U8视频下载:免费开源工具的终极懒人包 【免费下载链接】N_m3u8DL-CLI-SimpleG N_m3u8DL-CLIs simple GUI 项目地址: https://gitcode.com/gh_mirrors/nm3/N_m3u8DL-CLI-SimpleG 还在为下载在线视频发愁吗?那些藏在网页里的M3U8格式视频…...

Python爬虫实战:从零编写一个健壮的静态页面抓取器!

㊗️本期内容已收录至专栏《Python爬虫实战》,持续完善知识体系与项目实战,建议先订阅收藏,后续查阅更方便~ ㊙️本期爬虫难度指数:⭐⭐⭐ (进阶) 🉐福利: 一次订阅后,专栏内的所有文…...

工业设备数据采集太难?这款.NET8边缘网关,轻松搞定多协议对接

🌈前言如今工业数字化、智能化转型脚步越来越快,工厂现场各类 PLC、仪表、传感器设备型号繁杂,通信协议五花八门,设备数据采集难、协议对接繁琐、多设备统一管控麻烦,一直是很多制造企业、工控从业者头疼的实际问题。市…...

Python爬虫实战:构建博物馆藏品数字档案(列表到详情深度采集)

㊗️本期内容已收录至专栏《Python爬虫实战》,持续完善知识体系与项目实战,建议先订阅收藏,后续查阅更方便~ ㊙️本期爬虫难度指数:⭐⭐⭐ (进阶) 🉐福利: 一次订阅后,专栏内的所有文…...

AI不是产品,是技术,Apple想明白了

一个让我愣住的观点前几天刷 HackerNews,看到一篇被顶到榜首的文章,标题很短,就一句话,AI is a technology, not a product。不是因为这个观点多新奇,而是因为一个显而易见的事实,居然需要有人专门写一篇文…...

米哈游游戏字体库终极指南:轻松获取11款精美架空文字字体资源

米哈游游戏字体库终极指南:轻松获取11款精美架空文字字体资源 【免费下载链接】HoYo-Glyphs Constructed scripts by HoYoverse 米哈游的架空文字 项目地址: https://gitcode.com/gh_mirrors/ho/HoYo-Glyphs 想要为你的设计作品注入《原神》、《崩坏&#xf…...

中兴光猫工厂模式智能解锁:3步获得完全控制权限

中兴光猫工厂模式智能解锁:3步获得完全控制权限 【免费下载链接】zteOnu A tool that can open ZTE onu device factory mode 项目地址: https://gitcode.com/gh_mirrors/zt/zteOnu 你是否曾因中兴光猫的限制而无法进行高级网络配置?是否在需要深…...

三步破解安全研发合规难题:Gitee软件工厂助力GJB5000B与等保三级高标准落地

TL;DR 国家安全领域软件研发需同时满足GJB5000B、等保2.0三级等强制合规要求与智能化装备带来的软件复杂度挑战。传统研发模式在协作、安全、交付三方面日益乏力。Gitee软件工厂通过“统一底座、细粒度权限、标准化流程”三大核心能力,内置SM2/SM4国密加密、IP白名单…...

抖音视频批量下载工具:免费保存去水印内容完整指南

抖音视频批量下载工具:免费保存去水印内容完整指南 【免费下载链接】douyin-downloader A practical Douyin downloader for both single-item and profile batch downloads, with progress display, retries, SQLite deduplication, and browser fallback support.…...

终极微信聊天记录导出指南:用WeChatExporter彻底掌控你的数据主权

终极微信聊天记录导出指南:用WeChatExporter彻底掌控你的数据主权 【免费下载链接】WeChatExporter 一个可以快速导出、查看你的微信聊天记录的工具 项目地址: https://gitcode.com/gh_mirrors/wec/WeChatExporter 在数字时代,微信聊天记录承载着…...

Sunshine游戏串流服务器:如何5分钟内搭建私人云游戏平台?

Sunshine游戏串流服务器:如何5分钟内搭建私人云游戏平台? 【免费下载链接】Sunshine Self-hosted game stream host for Moonlight. 项目地址: https://gitcode.com/GitHub_Trending/su/Sunshine 想象一下,将你的高性能游戏PC变成一个…...

RAG 检索增强生成(全链路)

目录一、什么是RAG(Retrieval-augmented Generation)二、核心流程三、从零实战1. 环境准备2. 准备你的资料3. 代码4. 运行结果四、RAG全链路1. 文档切分(切块)2. Embedding 向量化3. 向量库存储4. 语义检索5. LLM生成回答必备5个工具(全免费&…...