当前位置: 首页 > article >正文

CANNBot Simulator V2参考文档

Simulator V2 Reference【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when the question is specifically about how simulator execution works now. Do not use it as a replacement for kernel-authoring or general architecture docs.GoalCapture the current simulator execution path so future work does not rely on removed or staleeasyasc/simulator/assumptions.1. Current defaultThe repositorys simulator path is now the V2 runtime.Current behavior:OpExec(..., simulatorTrue)enables simulator executionOpExec(..., simulatorv2)is an accepted spelling for the same pathOpExec(..., simulatorlegacy)is still accepted byOpExec, but it doesnotselect a separate old runtime anymore; it still routes to V2KernelBase.run_sim()always calls_run_sim_v2()Practical rule:do not document or debug a separateeasyasc/simulator/runtime as if it were still active2. How a kernel becomes a V2 programThe simulator build entry lives ineasyasc/kernelbase/kernelbase.py.The selection order is:custom builder viakernel._simulator_v2_program_builderprebuilt program viakernel._simulator_v2_programauto analysis auto bridge selectionAuto bridge selection:if the instruction stream contains control-flow, topology queries,call_micro,VarList, or cross-lane sync helpers, V2 useseasyasc/simulator_v2/compat/control_flow_bridge.pyotherwise V2 uses the narrow linear bridge ineasyasc/simulator_v2/compat/kernel_bridge.pyImportant difference:control_flow_bridge.pypreserves loops/conditionals and defers resolution to the runtimekernel_bridge.pyonly covers a narrower linear lowered-instruction subset3. Runtime stackThe runtime is split across these layers:parent coordinator:easyasc/simulator_v2/runtime/global_runtime.pycore process wrapper:easyasc/simulator_v2/runtime/core_process.pyper-core runtime:easyasc/simulator_v2/runtime/core_runtime.pylane-level control interpreter:easyasc/simulator_v2/runtime/control_actor.pypipe worker threads:easyasc/simulator_v2/runtime/pipe_worker.pypipe executors:easyasc/simulator_v2/ops/Execution shape:one parentGlobalRuntimeone childCoreProcessper simulated coreinside each active core, oneControlActorper active laneinside each lane, one threadedPipeWorkerper logical pipeLaunch rule:start simulator repros from a real.pyfile, not fromstdinentry points such aspython - PYor piped scriptsV2 uses multiprocessing during startup, and Python spawn must be able to re-import__main__from a real filesystem path;stdinentry points appear asstdinand break child startupwhen the launcher lives outside the repo root, include the repo root inPYTHONPATHso child processes can import local modules consistentlysafe pattern:PYTHONPATH/abs/path/to/repo python /tmp/repro.pyCompletion / shutdown facts:pipe workers already stop through mailbox sentinels; the thread layer does not need a special end instructionparent / child completion now uses a one-shot status channel that the parent polls while joiningGlobalRuntime.run()uses one global execution deadline across all active cores, not a full timeout budget per core in sequence4. Planning and activationCore and lane activation are resolved by:easyasc/simulator_v2/config.pyeasyasc/simulator_v2/runtime/execution_plan.pyeasyasc/simulator_v2/helpers.pyKey facts:default core count follows the active device family (950 - 32,b3 - 20)V2 can skip inactive lanes when a program only uses a subset of cube/vec lanescollective ops (allcube_*,allvec_*) affect lane-activation planning5. Memory and tensor stateShared tensor setup lives in:easyasc/simulator_v2/memory/shared_tensor.pyeasyasc/simulator_v2/memory/shared_tensor_store.pyeasyasc/simulator_v2/memory/tensor_view.pyeasyasc/simulator_v2/memory/workspace.pyeasyasc/simulator_v2/memory/local_memory.pyImportant facts:OpExecclones input tensors intoGMTensor.dataV2 copies that payload into the shared runtime tensor store before executionafter execution, V2 copies runtime tensors back into the boundGMTensor.dataworkspaces and local buffers are represented as shared-tensor specs in program metadatachild-core local tensors now go through a bank-aware allocator (UB0/UB1/L1/L0A/L0B/L0C); over-capacity local allocations fail before pipe execution startsruntime-created local slice snapshots must treat a root local tensorsSharedTensorSpec.storage_offsetas allocator bookkeeping in bytes rather than as an extra in-storage element offset; only nested local views should re-apply a parentstorage_offsetwhencontrol_actor.pymaterializes a dynamic slicesimulator-side GMatomic_add/atomic_max/atomic_minnow serialize their read-modify-write sections through a shared store-wide atomic lock so cross-core atomic writebacks do not lose updates under contentionRegression note:testcases/simulator/memory/test_simulator_v2_slice_tensor.pycovers the sliced-UB vec-mul case where several prefix UB allocations push the sliced root tensor onto a non-zero local bank offset before runtime snapshotting6. Sync and controlThe main sync/control pieces are:intra-core sync:easyasc/simulator_v2/sync/intra_core_sync.pycollective sync:easyasc/simulator_v2/sync/collective_sync.pylane-local flags:easyasc/simulator_v2/sync/local_flags.pylane-local events:easyasc/simulator_v2/sync/local_events.pyworker mailboxes:easyasc/simulator_v2/sync/mailbox.pyImportant fact:collective sync state is process-shared at runtime;GlobalRuntimesnapshots the parentCollectiveSyncand each child core reloads that shared state instead of creating a private per-process coordinatorlane-localbarrier(pipe...)currently has special runtime behavior only forbarrier(ALL); non-ALLbarriers are preserved as control instructions but act as no-ops in the V2 runtime main looppractical consequence for kernel debugging:bar_v()/bar_mte2()/ other single-pipe barriers do not serialize cross-pipe edges such asV - MTE2on the simulator path; when a repro needs a simulator-visible local drain across pipe domains, usebar_all()setflag/waitflagstill use the phase-basedLocalFlagTable, but localSEvent/DEventno longer do: V2 now models them with a per-lane flag bank keyed by(src_pipe, dst_pipe, flag_id)and a bool value per flagcreate_seventallocates oneflag_idfrom the lane-local pool for its(src_pipe, dst_pipe)pair;create_deventallocates two consecutive ids from that same pair-local poolSEvent.set()sets its single flag to1and errors if it is already1;SEvent.wait()blocks until that flag becomes1, then clears it back to0DEventkeeps two independent bool flags plus separateset_count/wait_countcursors: the producer-sidesetpath alternatesflag0, flag1, flag0, ..., and the consumer-sidewaitpath alternates on its own cursor over the same two flagsevent_setallis modeled as repeatedset()calls on the same event object rather than as a special bulk primitive; forDEventthat usually means setting both flags in rotation order, whileSEvent.setall()will replayset()twice and therefore errors on the second call if the single flag is still setevent_releaseis modeled as repeatedwait()calls:SEvent.release()performs one wait, whileDEvent.release()performs one wait and then performs a second wait only when a second outstanding token is already pending on the other rotated flagpractical consequence for trace/timing work: local event blocking must now be reasoned about per realflag_id, not perevent_nameregression coverage:testcases/simulator/bridge/test_simulator_v2_control_flow.pyWhen debugging a hang:inspect the original failing lane error firstthen inspect the sync state / timeout diagnosticdo not assume the timeout itself is the root causeWhen a child core raises an exception:GlobalRuntime.run()now raises the combined per-core traceback text directlydo not rely on a generic parent-side wrapper message; the actionable failure should already be in the thrown exception stringpipe-worker instruction failures now print an immediatestderrlog withlane/pipe/opname/error, control-sidewait_*paths poll worker failures while waiting, andCoreRuntime.join()prefers surfacing the more actionable worker/task failure over a secondary sync-timeout symptom when multiple lane actors fail7. Trace pathTrace recording lives in:easyasc/simulator_v2/trace/recorder.pyeasyasc/simulator_v2/trace/merge.pyeasyasc/simulator_v2/trace/chrome.pya5 cycle-model profile and estimators:easyasc/simulator_v2/timing/Runtime flow:each core records its own eventsparent runtime merges them after executiondump_chrome_trace(...)exports Chrome/Perfetto-style JSONruntime event timestamps originate fromtime.monotonic()exported Chrome traces normalize those timestamps into a per-run relative axis instead of replacing them with event-order indicesexporteddurnow reflects measured task/wait spans when the runtime recorded them; zero-duration control markers still use a tiny fallback width only to stay visible in viewerssync-heavy kernels may now emit explicitsynctrace events for wait/ready phases in addition to pipe execution eventson a5 (device_type 950), the runtime can now switch trace timing to a cycle-model domain driven by the JSON profile undertiming/; in that modeeasyasc_time_domain cycleis exported in the trace payload and task args include the modeling breakdowncurrent a5 cycle-model defaults treat one ordinary V-pipe instruction as2cyclesforcall_micro/vf()timing, register - UB shuffle instructions are counted as0cycle:micro_ub2reg,micro_reg2ub,micro_ub2regcont,micro_reg2ubcontin cycle-model mode, direct control-side waits (event_wait,wait_vec,wait_cube, collective waits) now advance the control actors cycle cursor, butevent_setno longer acts as a lane-global block for later unrelated pipe dispatch; its ready time is derived from the completed source pipe, and unrelated pipes can start as soon as their own event dependencies are satisfiedlane-localevent_wait/event_releasecan now be lowered into the destination pipe worker queue, so the blocking happens on that pipe thread instead of only on the control actor;event_set/event_setallintentionally stay control-side because their position in the instruction stream still defines autosync lifetime boundariestrace export now consultsglobvars.trace_event(defaultFalse): when disabled, all sync-style trace markers are omitted from dispatch, pipe, and sync tracks, including lane-localevent_*, local flag waits, intra-core handoff ops such aswait_vec/cube_ready, and collectiveall*sync ops; tests or debugging sessions that need those markers must enable the flag explicitly before running the simulatorwhen optimizing from the trace view, keepglobvars.trace_eventat its defaultFalseunless the specific goal is to inspect sync/event behavior; turning it on adds sync markers that are useful for debugging but can distract from the steady-state scheduling picture you usually want for optimization workwhen optimizing cycle count from a trace, use the trace makespan as the objective: the cycle at which the last timed event finishes (max(ts dur)overph Xevents). Do not optimize for the sum of all timed durations or total activated cycles; those overcount parallel overlap and can rank kernels differently from the real end-to-end completion time8. Vec and micro executionKey implementation files:vec runtime entry:easyasc/simulator_v2/ops/vec/v.pyvec legacy-layout helper:easyasc/simulator_v2/ops/vec/_legacy_vpipe.pyvec MTE2 path:easyasc/simulator_v2/ops/vec/mte2.pyvec MTE3 path:easyasc/simulator_v2/ops/vec/mte3.pymicro runtime:easyasc/simulator_v2/ops/micro/runtime.pypipe dispatch:easyasc/simulator_v2/ops/dispatch.pyImportant fact:several vec operations still reuse the legacy layout executor throughops/vec/_legacy_vpipe.py, but they run inside the V2 runtimewhengm_to_ub_padorl0c_to_gm_nz2ndreports a source/destination view that is too small on an a2 workspace-mediated tail path, first inspect whether the workspace view was cropped in the column dimension; those bridge ops infer row-stride from the parent GM shape, so a cropped workspace column span can fail even when the logical tail math is correctall UB burst copy ops (gm_to_ub_padinops/vec/mte2.py,ub_to_gm_padandub_to_l1_nzinops/vec/mte3.py) use_linear_view_from_pointerso that column-sliced UB views (ub[:, 0:valid_n]withvalid_n buffer_cols) round-trip through the underlying storage; any new burst-style op must mirror this pattern or it will falsely raise view is too small when the destination is non-contiguousregression coverage:testcases/simulator/datamove/test_gm_to_ub_pad_column_slice.pyScalar-semantics reminder:control_flow_bridge.pypreservesVararithmetic as runtime scalar ops such asvar_add,var_mul, andvar_divcontrol_actor.pyandops/micro/runtime.pymust preserve floatVarsemantics for those ops; do not silently coerce float scalar expressions to int on the runtime pathpractical symptom of a broken float-scalar path: raw cube/UB data looks correct, but a latervf()stage that multiplies by a computed scale suddenly collapses to09. Best first files for simulator debuggingeasyasc/kernelbase/kernelbase.pyeasyasc/simulator_v2/compat/control_flow_bridge.pyeasyasc/simulator_v2/compat/kernel_bridge.pyeasyasc/simulator_v2/runtime/control_actor.pyeasyasc/simulator_v2/runtime/task_memory_validator.pypre-dispatch memory-range checks now cover shared-tensor helpers, all current cube-pipe tensor ops, vec datamoves, V-pipe tensor ops including packedcompare/select, repeat-layout vec instructions,sort32,mergesort*,gather,scatter, task-level micro shared-tensor ops, andcall_microdry-run validationeasyasc/simulator_v2/runtime/pipe_worker.pyeasyasc/simulator_v2/runtime/global_runtime.pytestcases/simulator/【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

相关文章:

CANNBot Simulator V2参考文档

Simulator V2 Reference 【免费下载链接】cannbot-skills CANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。 项目地址: https://gitcode.com/cann/cannbot-skills Read this file when the question is specif…...

AI技术扩散六十年全景:从计算机科学到98%研究领域的渗透轨迹

1. 项目概述:一次跨越六十年的AI技术扩散全景扫描 如果你和我一样,长期关注人工智能领域的发展,可能会有一个直观的感受:AI似乎无处不在。从实验室里的蛋白质结构预测,到社交媒体上的内容推荐,再到艺术创作…...

GWAI平台:AI赋能引力波数据分析,从数据生成到模型评估的全栈解决方案

1. GWAI平台:引力波数据分析的AI新范式引力波,这个百年前由爱因斯坦广义相对论预言的时空涟漪,自2015年被LIGO首次直接探测以来,彻底改变了我们观测宇宙的方式。它让我们“听”到了黑洞并合、中子星碰撞等宇宙中最狂暴的事件。然而…...

Cursor-Office:AI驱动办公文档自动化处理插件深度解析

1. 项目概述与核心价值 最近在GitHub上看到一个挺有意思的项目,叫 Isaacpixier/cursor-office 。光看这个名字,你可能会有点摸不着头脑, cursor 是那个AI驱动的代码编辑器, office 是办公套件,这俩放一块儿能搞出…...

CANN HIXL Agent工作指引

AGENTS.md 【免费下载链接】hixl HIXL(Huawei Xfer Library)是一个灵活、高效的昇腾单边通信库,面向集群场景提供简单、可靠、高效的点对点数据传输能力。 项目地址: https://gitcode.com/cann/hixl 本文件为 Agent 在本仓库中工作提供…...

从CC2530F256到.hex:IAR工程配置中那些新手必踩的坑与避坑指南

从CC2530F256到.hex:IAR工程配置中那些新手必踩的坑与避坑指南 当你第一次在IAR Embedded Workbench中为CC2530F256创建工程时,可能会觉得整个过程就像在迷宫中穿行。特别是当教程只告诉你"点击这里"、"选择那个",却不解…...

AI赋能卫星通信:智能波束跳变与抗干扰技术深度解析

1. 项目概述:当AI遇见卫星通信的“矛”与“盾”最近和几个做卫星通信的老朋友聊天,大家不约而同地都在讨论同一个话题:AI。这让我想起十年前,我们还在为如何稳定地让卫星天线对准一颗高速移动的低轨卫星而绞尽脑汁,如今…...

Nodejs后端如何为在线服务集成多模型AI能力

🚀 告别海外账号与网络限制!稳定直连全球优质大模型,限时半价接入中。 👉 点击领取海量免费额度 Node.js 后端如何为在线服务集成多模型 AI 能力 现代 Web 应用的后端服务,尤其是基于 Node.js 构建的,经常…...

对比直连厂商Taotoken在多模型聚合与统一计费上的便捷体验

🚀 告别海外账号与网络限制!稳定直连全球优质大模型,限时半价接入中。 👉 点击领取海量免费额度 对比直连厂商与Taotoken在多模型聚合与统一计费上的便捷体验 效果展示类,从开发者实际体验出发,叙述同时使…...

从原理到代码:手撕Matlab畸变矫正算法,彻底搞懂内参矩阵与径向畸变参数

从归一化坐标到像素映射:Matlab畸变矫正算法的数学本质与工程实现 在计算机视觉领域,相机镜头畸变矫正是一个看似简单却蕴含丰富数学原理的基础问题。许多开发者习惯直接调用OpenCV或Matlab的现成函数,却对背后的坐标变换体系一知半解。本文…...

可解释AI的对抗攻击与防御:从SHAP/LIME脆弱性到鲁棒性实践

1. 项目概述:当AI的“黑箱”遭遇“压力测试”在AI模型日益渗透到信贷审批、医疗诊断、司法辅助等关键决策领域的今天,一个核心的信任危机始终悬而未决:我们如何相信一个自己都无法完全理解的“黑箱”系统?可解释人工智能&#xff…...

FastDeploy全场景AI推理部署:从模型转换到多硬件平台实战

1. 项目概述:从“能用”到“好用”的AI部署桥梁 如果你在AI工程化的路上摸爬滚打过一阵子,大概率会和我有同样的感受:把一个在实验室里跑得飞快的模型,真正搬到生产环境里稳定、高效地跑起来,这中间的鸿沟,…...

物流人必看:除了EIQ,你的WMS系统真的用对了吗?结合ABC分类优化库位与拣货路径实战

物流人必看:除了EIQ,你的WMS系统真的用对了吗?结合ABC分类优化库位与拣货路径实战 仓库管理系统(WMS)作为现代物流的核心工具,其价值远不止于简单的库存记录和出入库管理。真正高效的WMS应当是一个能够动态…...

基于ChatGPT的浏览器扩展开发指南:从原理到实战

1. 项目概述:一个浏览器扩展的诞生与价值 最近在折腾一些自动化流程,发现很多重复性的网页操作,比如批量整理信息、自动填写表单,或者是在浏览技术文档时快速提取代码片段,手动操作起来既繁琐又容易出错。作为一个习惯…...

保姆级教程:H3C NX30 PRO刷OpenWrt后,用Cron定时任务搞定烦人的LED灯

智能路由器灯光管理:OpenWrt定时任务实战指南 深夜的书房里,路由器LED指示灯像个小太阳一样刺眼。这种困扰对于追求完美使用体验的技术爱好者来说,简直不能忍。好在OpenWrt系统的强大自定义能力可以轻松解决这个问题——不需要复杂的命令行操…...

告别固定类别!用YOLO-World v2模型,5分钟实现自定义物体检测(附Python代码)

5分钟定制专属AI检测器:YOLO-World v2实战指南 去年帮朋友改造智能花房时,遇到个头疼的问题——市面上现成的物体检测模型根本识别不出他那些稀有兰花品种。正当我准备动手标注上千张图片重新训练模型时,偶然发现了YOLO-World这个"变形…...

Python proxypal库:代理协议适配与智能调度实战指南

1. 项目概述与核心价值 最近在折腾一些需要处理网络代理的自动化脚本时,发现了一个挺有意思的Python库,叫 proxypal 。乍一看名字,你可能会觉得它又是一个简单的代理IP池管理工具,市面上这类工具已经多如牛毛了。但实际用下来&a…...

基于OpenClaw框架的Asana自动化集成:打破数据孤岛,构建事件驱动工作流

1. 项目概述:一个连接Asana与本地工作流的自动化桥梁 最近在折腾自动化工作流,发现很多团队的核心任务管理都放在Asana上,但一些本地化的脚本、数据处理或者内部系统的触发,却很难和Asana无缝联动。手动在两个系统间同步状态、复制…...

如何像专业人士一样删除Android上的游戏数据

有时,您可能出于各种原因想要删除Android手机上的游戏数据。您可能想要重新开始游戏、修复性能问题(例如卡顿或崩溃),或者只是为了释放存储空间。随着游戏数据的积累,它们会占用大量空间,从而导致手机运行缓…...

CANN/cann-bench MoeReRouting算子API描述

MoeReRouting 算子 API 描述 【免费下载链接】cann-bench 评测AI在处理CANN领域代码任务的能力,涵盖算子生成、算子优化等领域,支撑模型选型、训练效果评估,统一量化评估标准,识别Agent能力短板,构建CANN领域评测平台&…...

基于零知识证明与Cardano的隐私优先AI赏金池系统NightPay实战指南

1. 项目概述:一个为AI智能体设计的隐私优先赏金池系统如果你正在寻找一种既能激励AI智能体完成特定任务,又能完全保护资金提供者隐私的解决方案,那么NightPay很可能就是你需要的工具。简单来说,NightPay是一个建立在Midnight隐私网…...

MAX3420E USB控制器开发实战与优化技巧

1. MAX3420E USB控制器概述 MAX3420E是一款全速USB外设控制器芯片,广泛应用于嵌入式系统开发中。作为USB协议栈的硬件实现载体,它通过SPI接口与主控MCU通信,减轻了主控处理USB协议的压力。芯片内部集成了USB串行接口引擎(SIE)、端点FIFO缓冲区…...

CANN/ops-nn Gelu激活函数算子

Gelu 【免费下载链接】ops-nn 本项目是CANN提供的神经网络类计算算子库,实现网络在NPU上加速计算。 项目地址: https://gitcode.com/cann/ops-nn 产品支持情况 产品是否支持Ascend 950PR/Ascend 950DT√Atlas A3 训练系列产品/Atlas A3 推理系列产品√Atlas…...

大语言模型在仇恨言论检测中的实践:从零样本提示到系统部署

1. 项目概述:当大语言模型成为“网络清道夫”在互联网内容生态治理的战场上,自动化检测系统一直是核心防线。传统的基于规则或传统机器学习的方法,往往在语言的微妙性、语境依赖性和快速演变的网络用语面前捉襟见肘。仇恨言论的检测尤其棘手&…...

蓝桥杯嵌入式STM32G431按键实战:从CubeMX配置到长按短按识别(附完整代码)

蓝桥杯嵌入式STM32G431按键实战:从CubeMX配置到长按短按识别(附完整代码) 在嵌入式系统开发中,按键处理看似简单,实则暗藏玄机。一个健壮的按键模块需要解决抖动干扰、长短按识别、多任务协调等问题,这正是…...

深度解析:DeepSeek集成项目的微服务架构与配置管理最佳实践

深度解析:DeepSeek集成项目的微服务架构与配置管理最佳实践 【免费下载链接】awesome-deepseek-integration Integrate the DeepSeek API into popular software 项目地址: https://gitcode.com/GitHub_Trending/aw/awesome-deepseek-integration 在AI应用快…...

金融监管AI实战:从模型部署到风险管理的挑战与应对

1. 项目概述:当AI遇见金融监管的“深水区”最近几年,和不少在银行、券商和监管科技公司工作的朋友聊天,一个绕不开的话题就是AI。大家聊的已经不是“要不要用”,而是“怎么用”和“用起来有多头疼”。从反洗钱(AML&…...

解锁车辆新姿势:从PEPS解锁看AUTOSAR局部网络管理(Partial NM)如何省电

解锁车辆新姿势:从PEPS解锁看AUTOSAR局部网络管理如何省电 当车主在停车场按下智能钥匙的解锁按钮时,车辆不会像传统机械钥匙那样全车通电——只有门锁控制器和车身控制模块(BCM)被悄然唤醒,而仪表盘、中控屏等系统仍在…...

HCOMM获取拓扑层级rank数量

HcclRankGraphGetRankSizeByLayer 【免费下载链接】hcomm HCOMM(Huawei Communication)是HCCL的通信基础库,提供通信域以及通信资源的管理能力。 项目地址: https://gitcode.com/cann/hcomm 产品支持情况 Ascend 950PR/Ascend 950DT&…...

2025最权威的十大AI辅助论文工具实测分析

Ai论文网站排名(开题报告、文献综述、降aigc率、降重综合对比) TOP1. 千笔AI TOP2. aipasspaper TOP3. 清北论文 TOP4. 豆包 TOP5. kimi TOP6. deepseek 聚焦大语言模型架构创新以及训练优化展开研究的是DeepSeek论文,该模型运用混合专…...