当前位置：首页 > article >正文

CANNBot Simulator V2参考文档

article 2026/5/9 21:24:14

Simulator V2 Reference【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when the question is specifically about how simulator execution works now. Do not use it as a replacement for kernel-authoring or general architecture docs.GoalCapture the current simulator execution path so future work does not rely on removed or staleeasyasc/simulator/assumptions.1. Current defaultThe repositorys simulator path is now the V2 runtime.Current behavior:OpExec(..., simulatorTrue)enables simulator executionOpExec(..., simulatorv2)is an accepted spelling for the same pathOpExec(..., simulatorlegacy)is still accepted byOpExec, but it doesnotselect a separate old runtime anymore; it still routes to V2KernelBase.run_sim()always calls_run_sim_v2()Practical rule:do not document or debug a separateeasyasc/simulator/runtime as if it were still active2. How a kernel becomes a V2 programThe simulator build entry lives ineasyasc/kernelbase/kernelbase.py.The selection order is:custom builder viakernel._simulator_v2_program_builderprebuilt program viakernel._simulator_v2_programauto analysis auto bridge selectionAuto bridge selection:if the instruction stream contains control-flow, topology queries,call_micro,VarList, or cross-lane sync helpers, V2 useseasyasc/simulator_v2/compat/control_flow_bridge.pyotherwise V2 uses the narrow linear bridge ineasyasc/simulator_v2/compat/kernel_bridge.pyImportant difference:control_flow_bridge.pypreserves loops/conditionals and defers resolution to the runtimekernel_bridge.pyonly covers a narrower linear lowered-instruction subset3. Runtime stackThe runtime is split across these layers:parent coordinator:easyasc/simulator_v2/runtime/global_runtime.pycore process wrapper:easyasc/simulator_v2/runtime/core_process.pyper-core runtime:easyasc/simulator_v2/runtime/core_runtime.pylane-level control interpreter:easyasc/simulator_v2/runtime/control_actor.pypipe worker threads:easyasc/simulator_v2/runtime/pipe_worker.pypipe executors:easyasc/simulator_v2/ops/Execution shape:one parentGlobalRuntimeone childCoreProcessper simulated coreinside each active core, oneControlActorper active laneinside each lane, one threadedPipeWorkerper logical pipeLaunch rule:start simulator repros from a real.pyfile, not fromstdinentry points such aspython - PYor piped scriptsV2 uses multiprocessing during startup, and Python spawn must be able to re-import__main__from a real filesystem path;stdinentry points appear asstdinand break child startupwhen the launcher lives outside the repo root, include the repo root inPYTHONPATHso child processes can import local modules consistentlysafe pattern:PYTHONPATH/abs/path/to/repo python /tmp/repro.pyCompletion / shutdown facts:pipe workers already stop through mailbox sentinels; the thread layer does not need a special end instructionparent / child completion now uses a one-shot status channel that the parent polls while joiningGlobalRuntime.run()uses one global execution deadline across all active cores, not a full timeout budget per core in sequence4. Planning and activationCore and lane activation are resolved by:easyasc/simulator_v2/config.pyeasyasc/simulator_v2/runtime/execution_plan.pyeasyasc/simulator_v2/helpers.pyKey facts:default core count follows the active device family (950 - 32,b3 - 20)V2 can skip inactive lanes when a program only uses a subset of cube/vec lanescollective ops (allcube_*,allvec_*) affect lane-activation planning5. Memory and tensor stateShared tensor setup lives in:easyasc/simulator_v2/memory/shared_tensor.pyeasyasc/simulator_v2/memory/shared_tensor_store.pyeasyasc/simulator_v2/memory/tensor_view.pyeasyasc/simulator_v2/memory/workspace.pyeasyasc/simulator_v2/memory/local_memory.pyImportant facts:OpExecclones input tensors intoGMTensor.dataV2 copies that payload into the shared runtime tensor store before executionafter execution, V2 copies runtime tensors back into the boundGMTensor.dataworkspaces and local buffers are represented as shared-tensor specs in program metadatachild-core local tensors now go through a bank-aware allocator (UB0/UB1/L1/L0A/L0B/L0C); over-capacity local allocations fail before pipe execution startsruntime-created local slice snapshots must treat a root local tensorsSharedTensorSpec.storage_offsetas allocator bookkeeping in bytes rather than as an extra in-storage element offset; only nested local views should re-apply a parentstorage_offsetwhencontrol_actor.pymaterializes a dynamic slicesimulator-side GMatomic_add/atomic_max/atomic_minnow serialize their read-modify-write sections through a shared store-wide atomic lock so cross-core atomic writebacks do not lose updates under contentionRegression note:testcases/simulator/memory/test_simulator_v2_slice_tensor.pycovers the sliced-UB vec-mul case where several prefix UB allocations push the sliced root tensor onto a non-zero local bank offset before runtime snapshotting6. Sync and controlThe main sync/control pieces are:intra-core sync:easyasc/simulator_v2/sync/intra_core_sync.pycollective sync:easyasc/simulator_v2/sync/collective_sync.pylane-local flags:easyasc/simulator_v2/sync/local_flags.pylane-local events:easyasc/simulator_v2/sync/local_events.pyworker mailboxes:easyasc/simulator_v2/sync/mailbox.pyImportant fact:collective sync state is process-shared at runtime;GlobalRuntimesnapshots the parentCollectiveSyncand each child core reloads that shared state instead of creating a private per-process coordinatorlane-localbarrier(pipe...)currently has special runtime behavior only forbarrier(ALL); non-ALLbarriers are preserved as control instructions but act as no-ops in the V2 runtime main looppractical consequence for kernel debugging:bar_v()/bar_mte2()/ other single-pipe barriers do not serialize cross-pipe edges such asV - MTE2on the simulator path; when a repro needs a simulator-visible local drain across pipe domains, usebar_all()setflag/waitflagstill use the phase-basedLocalFlagTable, but localSEvent/DEventno longer do: V2 now models them with a per-lane flag bank keyed by(src_pipe, dst_pipe, flag_id)and a bool value per flagcreate_seventallocates oneflag_idfrom the lane-local pool for its(src_pipe, dst_pipe)pair;create_deventallocates two consecutive ids from that same pair-local poolSEvent.set()sets its single flag to1and errors if it is already1;SEvent.wait()blocks until that flag becomes1, then clears it back to0DEventkeeps two independent bool flags plus separateset_count/wait_countcursors: the producer-sidesetpath alternatesflag0, flag1, flag0, ..., and the consumer-sidewaitpath alternates on its own cursor over the same two flagsevent_setallis modeled as repeatedset()calls on the same event object rather than as a special bulk primitive; forDEventthat usually means setting both flags in rotation order, whileSEvent.setall()will replayset()twice and therefore errors on the second call if the single flag is still setevent_releaseis modeled as repeatedwait()calls:SEvent.release()performs one wait, whileDEvent.release()performs one wait and then performs a second wait only when a second outstanding token is already pending on the other rotated flagpractical consequence for trace/timing work: local event blocking must now be reasoned about per realflag_id, not perevent_nameregression coverage:testcases/simulator/bridge/test_simulator_v2_control_flow.pyWhen debugging a hang:inspect the original failing lane error firstthen inspect the sync state / timeout diagnosticdo not assume the timeout itself is the root causeWhen a child core raises an exception:GlobalRuntime.run()now raises the combined per-core traceback text directlydo not rely on a generic parent-side wrapper message; the actionable failure should already be in the thrown exception stringpipe-worker instruction failures now print an immediatestderrlog withlane/pipe/opname/error, control-sidewait_*paths poll worker failures while waiting, andCoreRuntime.join()prefers surfacing the more actionable worker/task failure over a secondary sync-timeout symptom when multiple lane actors fail7. Trace pathTrace recording lives in:easyasc/simulator_v2/trace/recorder.pyeasyasc/simulator_v2/trace/merge.pyeasyasc/simulator_v2/trace/chrome.pya5 cycle-model profile and estimators:easyasc/simulator_v2/timing/Runtime flow:each core records its own eventsparent runtime merges them after executiondump_chrome_trace(...)exports Chrome/Perfetto-style JSONruntime event timestamps originate fromtime.monotonic()exported Chrome traces normalize those timestamps into a per-run relative axis instead of replacing them with event-order indicesexporteddurnow reflects measured task/wait spans when the runtime recorded them; zero-duration control markers still use a tiny fallback width only to stay visible in viewerssync-heavy kernels may now emit explicitsynctrace events for wait/ready phases in addition to pipe execution eventson a5 (device_type 950), the runtime can now switch trace timing to a cycle-model domain driven by the JSON profile undertiming/; in that modeeasyasc_time_domain cycleis exported in the trace payload and task args include the modeling breakdowncurrent a5 cycle-model defaults treat one ordinary V-pipe instruction as2cyclesforcall_micro/vf()timing, register - UB shuffle instructions are counted as0cycle:micro_ub2reg,micro_reg2ub,micro_ub2regcont,micro_reg2ubcontin cycle-model mode, direct control-side waits (event_wait,wait_vec,wait_cube, collective waits) now advance the control actors cycle cursor, butevent_setno longer acts as a lane-global block for later unrelated pipe dispatch; its ready time is derived from the completed source pipe, and unrelated pipes can start as soon as their own event dependencies are satisfiedlane-localevent_wait/event_releasecan now be lowered into the destination pipe worker queue, so the blocking happens on that pipe thread instead of only on the control actor;event_set/event_setallintentionally stay control-side because their position in the instruction stream still defines autosync lifetime boundariestrace export now consultsglobvars.trace_event(defaultFalse): when disabled, all sync-style trace markers are omitted from dispatch, pipe, and sync tracks, including lane-localevent_*, local flag waits, intra-core handoff ops such aswait_vec/cube_ready, and collectiveall*sync ops; tests or debugging sessions that need those markers must enable the flag explicitly before running the simulatorwhen optimizing from the trace view, keepglobvars.trace_eventat its defaultFalseunless the specific goal is to inspect sync/event behavior; turning it on adds sync markers that are useful for debugging but can distract from the steady-state scheduling picture you usually want for optimization workwhen optimizing cycle count from a trace, use the trace makespan as the objective: the cycle at which the last timed event finishes (max(ts dur)overph Xevents). Do not optimize for the sum of all timed durations or total activated cycles; those overcount parallel overlap and can rank kernels differently from the real end-to-end completion time8. Vec and micro executionKey implementation files:vec runtime entry:easyasc/simulator_v2/ops/vec/v.pyvec legacy-layout helper:easyasc/simulator_v2/ops/vec/_legacy_vpipe.pyvec MTE2 path:easyasc/simulator_v2/ops/vec/mte2.pyvec MTE3 path:easyasc/simulator_v2/ops/vec/mte3.pymicro runtime:easyasc/simulator_v2/ops/micro/runtime.pypipe dispatch:easyasc/simulator_v2/ops/dispatch.pyImportant fact:several vec operations still reuse the legacy layout executor throughops/vec/_legacy_vpipe.py, but they run inside the V2 runtimewhengm_to_ub_padorl0c_to_gm_nz2ndreports a source/destination view that is too small on an a2 workspace-mediated tail path, first inspect whether the workspace view was cropped in the column dimension; those bridge ops infer row-stride from the parent GM shape, so a cropped workspace column span can fail even when the logical tail math is correctall UB burst copy ops (gm_to_ub_padinops/vec/mte2.py,ub_to_gm_padandub_to_l1_nzinops/vec/mte3.py) use_linear_view_from_pointerso that column-sliced UB views (ub[:, 0:valid_n]withvalid_n buffer_cols) round-trip through the underlying storage; any new burst-style op must mirror this pattern or it will falsely raise view is too small when the destination is non-contiguousregression coverage:testcases/simulator/datamove/test_gm_to_ub_pad_column_slice.pyScalar-semantics reminder:control_flow_bridge.pypreservesVararithmetic as runtime scalar ops such asvar_add,var_mul, andvar_divcontrol_actor.pyandops/micro/runtime.pymust preserve floatVarsemantics for those ops; do not silently coerce float scalar expressions to int on the runtime pathpractical symptom of a broken float-scalar path: raw cube/UB data looks correct, but a latervf()stage that multiplies by a computed scale suddenly collapses to09. Best first files for simulator debuggingeasyasc/kernelbase/kernelbase.pyeasyasc/simulator_v2/compat/control_flow_bridge.pyeasyasc/simulator_v2/compat/kernel_bridge.pyeasyasc/simulator_v2/runtime/control_actor.pyeasyasc/simulator_v2/runtime/task_memory_validator.pypre-dispatch memory-range checks now cover shared-tensor helpers, all current cube-pipe tensor ops, vec datamoves, V-pipe tensor ops including packedcompare/select, repeat-layout vec instructions,sort32,mergesort*,gather,scatter, task-level micro shared-tensor ops, andcall_microdry-run validationeasyasc/simulator_v2/runtime/pipe_worker.pyeasyasc/simulator_v2/runtime/global_runtime.pytestcases/simulator/【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANNBot Simulator V2参考文档

相关文章：

CANNBot Simulator V2参考文档

AI技术扩散六十年全景：从计算机科学到98%研究领域的渗透轨迹

GWAI平台：AI赋能引力波数据分析，从数据生成到模型评估的全栈解决方案

Cursor-Office：AI驱动办公文档自动化处理插件深度解析

CANN HIXL Agent工作指引

从CC2530F256到.hex：IAR工程配置中那些新手必踩的坑与避坑指南

AI赋能卫星通信：智能波束跳变与抗干扰技术深度解析

Nodejs后端如何为在线服务集成多模型AI能力

对比直连厂商Taotoken在多模型聚合与统一计费上的便捷体验

从原理到代码：手撕Matlab畸变矫正算法，彻底搞懂内参矩阵与径向畸变参数

可解释AI的对抗攻击与防御：从SHAP/LIME脆弱性到鲁棒性实践

FastDeploy全场景AI推理部署：从模型转换到多硬件平台实战

物流人必看：除了EIQ，你的WMS系统真的用对了吗？结合ABC分类优化库位与拣货路径实战

基于ChatGPT的浏览器扩展开发指南：从原理到实战

保姆级教程：H3C NX30 PRO刷OpenWrt后，用Cron定时任务搞定烦人的LED灯

告别固定类别！用YOLO-World v2模型，5分钟实现自定义物体检测（附Python代码）

Python proxypal库：代理协议适配与智能调度实战指南

基于OpenClaw框架的Asana自动化集成：打破数据孤岛，构建事件驱动工作流

如何像专业人士一样删除Android上的游戏数据

CANN/cann-bench MoeReRouting算子API描述

基于零知识证明与Cardano的隐私优先AI赏金池系统NightPay实战指南

MAX3420E USB控制器开发实战与优化技巧

CANN/ops-nn Gelu激活函数算子

大语言模型在仇恨言论检测中的实践：从零样本提示到系统部署

蓝桥杯嵌入式STM32G431按键实战：从CubeMX配置到长按短按识别（附完整代码）

深度解析：DeepSeek集成项目的微服务架构与配置管理最佳实践

金融监管AI实战：从模型部署到风险管理的挑战与应对

解锁车辆新姿势：从PEPS解锁看AUTOSAR局部网络管理（Partial NM）如何省电

HCOMM获取拓扑层级rank数量

2025最权威的十大AI辅助论文工具实测分析