当前位置：首页 > article >正文

【202511】Cosmos-Predict2.5-02-模型篇：用于PhysicalAI的基于视频基础模型的世界模拟【网络架构：DiT】【视觉Tokenizer：WAN2.1 VAE】【16fps】

article 2026/4/29 1:08:18

《World Simulation with Video Foundation Models for Physical AI》Method3. 方法In this section, we first discuss our flow-matching formulation and then present the network architecture.在本节中，我们首先讨论我们的 flow-matching 表述，然后介绍网络架构。3.1. Flow MatchingWe adopt flow matching (FM) (Lipman et al., 2022) for training diffusion models because of its conceptual simplicity and practical effectiveness. While FM and the Elucidated Diffusion Model (EDM) (Karras et al., 2022), which was used in [Cosmos-Predict1] (NVIDIA, 2025), are mathematically equivalent in terms of their forward and backward diffusion processes, they differ in how the denoising network is parameterized (Gao et al., 2025). In EDM, the preconditioning coefficients are chosen so that both the inputs and outputs of the denoising network are approximately standardized Gaussians, which simplifies training and improves stability. In contrast, FM selects coefficients that make the denoising network predict the velocity of the diffusion trajectory. This velocity-based formulation not only provides a more direct training target but also tends to yield smoother optimization and improved sample quality in practice.我们采用 flow matching (FM) (Lipman et al., 2022) 来训练 diffusion models，因为它在概念上简洁且在实践中有效。尽管 FM 与 Elucidated Diffusion Model (EDM) (Karras et al., 2022)——即 [Cosmos-Predict1] (NVIDIA, 2025) 中所使用的方法——在前向和后向扩散过程的数学形式上是等价的，但它们在 denoising network 的参数化方式上有所不同 (Gao etal., 2025)。在 EDM 中，preconditioning 系数的选择使得 denoising network 的输入和输出都近似为标准化高斯分布，从而简化训练并提高稳定性。相比之下，FM 选择的系数使 denoising network 预测扩散轨迹的 velocity。这种基于velocity 的形式不仅提供了更直接的训练目标，而且在实践中往往能够带来更平滑的优化过程和更好的 sample quality。Formally, given a data sample x (image or video), a noise vectorϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal { N } ( 0 , I )ϵ∼N(0,I), and a timestept ∈ [ 0 , 1 ] t \in [ 0 , 1 ]t∈[0,1]drawn from a logitnormal distribution, the interpolated latentx t \mathbf { x } _ { t }xtis defined as形式化地，给定一个数据样本x \mathsf { x }x（image 或 video）、一个噪声向量ϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal { N } ( 0 , I )ϵ∼N(0,I)，以及一个从 logit-normaldistribution 中采样得到的时间步t ∈ [ 0 , 1 ] t \in [ 0 , 1 ]t∈[0,1]，插值后的 latentx t \mathbf { x } _ { t }xt定义为x t = ( 1 − t ) x + t ϵ . \mathbf { x } _ { t } = ( 1 - t ) \mathbf { x } + t { \boldsymbol { \epsilon } } .xt=(1−t)x+tϵ.The corresponding ground-truth velocity is对应的 ground-truth velocity 为v t = ϵ − x . \mathbf { v } _ { t } = \epsilon - \mathbf { x } .vt=ϵ−x.The model is trained to predictv t \mathbf { v } _ { t }vtby minimizing the mean squared error (MSE) between the prediction and ground truth:模型通过最小化预测值与真实值之间的均方误差（MSE）来训练以预测v t \mathbf { v } _ { t }vt：L ( θ ) = E x , ϵ , c , t ∥ u ( x t , t , c ; θ ) − v t ∥ 2 , \begin{array} { r } { \mathcal { L } ( \boldsymbol { \theta } ) = \mathbb { E } _ { \mathbf { x } , \boldsymbol { \epsilon } , \mathbf { c } , t } \| \mathbf { u } \left( \mathbf { x } _ { t } , t , \mathbf { c } ; \boldsymbol { \theta } \right) - \mathbf { v } _ { t } \| ^ { 2 } , } \end{array}L(θ)=Ex,ϵ,c,t∥u(xt,t,c;θ)−vt∥2,where denotes conditioning information associated withx \mathbf { x }x(e.g., text embeddings, reference frames, and other conditiona inputs),θ \thetaθrepresents the model parameters, andu ( ⋅ ; θ ) \mathbf { u } ( \cdot ; \theta )u(⋅;θ)is the predicted velocity function.其中，表示与x \mathbf { x }x相关的条件信息（例如，文本嵌入、参考帧和其他条件输入），θ \thetaθ表示模型 Parameter，而u ( ⋅ ; θ ) \mathbf { u } ( \cdot ; \theta )u(⋅;θ)是预测的速度函数。High-resolution content often contains significant redundancy, since nearby pixels are highly correlated. As a result, if the level of injected noise is too small, the model may fail to “break apart” this correlation, making it harder for the FM model to learn meaningful structure (Esser et al., 2024; Hoogeboom et al., 2023; Chen, 2023; Atzmon et al., 2024). To address this, we deliberately bias the training process toward higher noise levels. Specifically, we adopt the shifted logit-normal distribution (Esser et al., 2024). In practice, we first samplet ttfrom a logit-normal distribution, and then apply the monotone transformation高分辨率内容通常包含显著的冗余，因为相邻像素之间具有很强的相关性。因此，如果注入噪声的水平过小，模型可能无法“打破”这种相关性，从而使 FM 模型更难学习到有意义的结构（Esser et al., 2024; Hoogeboom et al., 2023; Chen,2023; Atzmon et al., 2024）。为了解决这一问题，我们有意将训练过程偏向于更高的噪声水平。具体而言，我们采用shifted logit-normal 分布（Esser et al., 2024）。在实践中，我们首先从 logit-normal 分布中采样t tt，然后应用如下单调变换t s = β t 1 + ( β − 1 ) t t _ { s } = \frac { \beta t } { 1 + ( \beta - 1 ) t }ts=1+(β−1)tβtwhereβ \betaβis a shift hyper-parameter. This transformation reweights the distribution so thatt s t _ { s }tsvalues are skewed其中，β \betaβ是一个 shift 超参数。该变换对分布进行重新加权，使得t s t _ { s }ts值呈偏斜分布表 3： [Cosmos-Predict2.5] 模型的配置细节。Configuration配置Cosmos-Predict2.5-2BCosmos-Predict2.5-14BNumber of Layers层数3236Model Dimension模型维度2,0485,120FFN Hidden DimensionFFN隐藏维度8,19220,480AdaLN-LoRA DimensionAdaLN-LoRA维度256256Number of Attention Heads注意力头数量1640Head Dimension头维度128128MLP ActivationMLP 激活函数GELUPositional Embedding位置 Embedding3D RoPE朝着更高噪声的方向。直观地说，增大β \betaβ会使模型更频繁地遇到噪声更强的输入，这有助于它学习在相关性被严重破坏时仍然重建信号。当β = 1 \beta = 1β=1时，不施加偏移，且t s = t t _ { s } = tts=t。3.2. 网络架构In [Cosmos-Predict2.5], we largely reuse the denoising networku ( ⋅ , θ ) \mathbf { u } ( \cdot , \theta )u(⋅,θ)introduced in [Cosmos-Predict1]'s DiT (NVIDIA, 2025), which is based on a latent diffusion model. The main architectural change is the removal of the absolute positional embeddings and only keeping the relative positional embeddings. While absolute embeddings provide a fixed spatial or temporal reference, they limit the model’s ability to generalize to resolutions or sequence lengths not seen during training. By removing them, [Cosmos-Predict2.5] gains greater flexibility for handling higher-resolution content and longer video sequences during post-training. This design choice is motivated by recent progress in long-context large language models, where alternative positional encoding strategies (Peng et al., 2023; bloc97, 2023) have proven effective at extending context length without sacrificing performance. The overall velocity prediction network design is illustrated in Fig. 2.在 [Cosmos-Predict2.5] 中，我们在很大程度上复用了 [Cosmos-Predict1] 的 DiT（NVIDIA，2025）中引入的去噪网络u ( ⋅ , θ ) \mathbf { u } ( \cdot , \theta )u(⋅,θ)，其基于 latent diffusion model。主要的架构变化是移除了绝对位置嵌入，仅保留相对位置嵌入。虽然绝对嵌入提供了固定的空间或时间参考，但它们会限制模型泛化到训练期间未见过的分辨率或序列长度的能力。通过移除它们，[Cosmos-Predict2.5] 在后训练期间处理更高分辨率内容和更长视频序列时获得了更大的灵活性。这一设计选择受到了长上下文大语言模型最新进展的启发，其中替代性的 Positional Encoding 策略（Peng et al., 2023；bloc97, 2023）已被证明能够在不牺牲性能的情况下有效扩展上下文长度。整体速度预测网络设计如图 2 所示。We adopt a different set of auxiliary models in [Cosmos-Predict2.5] compared to [Cosmos-Predict1], with improvements in both visual and textual representations. For the visual tokenizer, we use WAN2.1 VAE (Wan et al., 2025), a causal variational autoencoder that compresses video sequences with a compression rate of4 × 8 × 8 4 \times 8 \times 84×8×8across the time, height, and width dimensions, respectively. This compression greatly reduces the computational cost while preserving essential spatiotemporal structure. On top of this representation, we apply the same1 × 2 × 2 1 \times 2 \times 21×2×2patchification strategy to compress latent features further. We train our model to generate 93 frames, which corresponds to 24 latent frames, at a time using 16 fps videos. Each of the generated videos is about 5.8 seconds long.与 [Cosmos-Predict1] 相比，我们在 [Cosmos-Predict2.5] 中采用了一组不同的辅助模型，并在视觉和文本表征方面均有所改进。对于视觉 Tokenizer，我们使用 WAN2.1 VAE（Wan et al., 2025），这是一种因果变分自编码器，能够分别在时间、高度和宽度维度上以4 × 8 × 8 4 \times 8 \times 84×8×8的压缩率对视频序列进行压缩。这种压缩在保留关键时空结构的同时，大幅降低了计算成本。在此表征之上，我们进一步采用相同的1 × 2 × 2 1 \times 2 \times 21×2×2patchification 策略来压缩潜在特征。我们使用 16fps 视频训练模型，使其一次生成 93 帧，对应 24 个潜在帧。每个生成的视频时长约为 5.8 秒。For the text encoder, we leverage [Cosmos-Reason1] (NVIDIA, 2025) instead of the T5 encoder used in [CosmosPredict1]. Unlike standard approaches that rely on the output of a single transformer layer, we concatenate activations across multiple blocks for each token and project them into a 1024-dimensional space inspired by Wang et al. (2025). This yields a sequence of embedding vectors that more faithfully captures both local and global linguistic context. During training, these embeddings are integrated into the denoising process via cross-attention layers, enabling textual prompts to directly guide video generation. Moreover, the vision encoder in [Cosmos-Reason1] supports additional visual conditional inputs for style control, which we leave as an exciting direction for future exploration.对于文本编码器，我们采用 [Cosmos-Reason1]（NVIDIA，2025），而不是 [CosmosPredict1] 中使用的 T5 编码器。不同于依赖单个 Transformer 层输出的标准方法，我们针对每个 Token 拼接多个 block 的激活，并将其投影到受 Wang etal. (2025) 启发的 1024 维空间中。这会产生一系列 Embedding 向量，能够更忠实地捕捉局部和全局语言上下文。在训练过程中，这些 Embedding 通过 cross-attention 层被集成到去噪过程中，使文本提示能够直接引导视频生成。此外，[Cosmos-Reason1] 中的视觉编码器支持用于风格控制的额外视觉条件输入，我们将其保留为未来探索的一个令人兴奋的方向。Each [Cosmos-Predict2.5] model is designed to operate in three modes: Text2World, Image2World, and Video2World. In the Text2World setting, generation is guided solely by a text prompt. In Image2World, the model receives both a text prompt and a reference image, allowing it to ground the generated video in specific visual content. In Video2World, the model further extends this conditioning to video sequences, enabling temporally coherent continuation or transformation of input clips.每个 [Cosmos-Predict2.5] 模型都被设计为在三种模式下运行：Text2World、Image2World 、Video2World。在Text2World 设置中，生成过程仅由文本Prompt 引导。在 Image2World 中，模型同时接收文本Prompt 和参考图像，从而能够将生成的视频锚定到特定的视觉内容上。在 Video2World 中，模型进一步将这种条件扩展到视频序列，从而实现对输入片段在时间上连贯的延续或变换。Figure 2: Figure 2: Overall architecture of [Cosmos-Predict2.5]. As shown on the right, in the latent space, the model applies repeated blocks of self-attention, cross-attention, and feed-forward MLP layers, modulated by adaptive layer normalization (scale, shift, gate) for a given time stept t

【202511】Cosmos-Predict2.5-02-模型篇：用于PhysicalAI的基于视频基础模型的世界模拟【网络架构：DiT】【视觉Tokenizer：WAN2.1 VAE】【16fps】

相关文章：

【202511】Cosmos-Predict2.5-02-模型篇：用于PhysicalAI的基于视频基础模型的世界模拟【网络架构：DiT】【视觉Tokenizer：WAN2.1 VAE】【16fps】

Vue2项目实战：如何给你的原生下拉框加上‘模糊搜索’和‘多选标签’功能（附完整代码）

数字随机存内计算(DS-CIM)架构解析与边缘AI应用

Unity新手避坑指南：DoTween插件从安装到第一个动画的保姆级教程

ARM CoreSight ETM11调试技术详解与应用实践

MediaCreationTool.bat：让Windows系统安装变得前所未有的简单

TPFanCtrl2：ThinkPad双风扇嵌入式控制器直连温控架构解析与128级精准调速优化方案

原神60帧限制破解指南：如何安全解锁高帧率游戏体验

自动驾驶系统模型驱动开发与ROS 2集成实践

详解中间人攻击中的流量欺骗与流量劫持总结，黑客技术零基础入门到精通实战教程建议收藏！

5分钟掌握城通网盘直连解析工具：告别龟速下载的终极指南

告别卡顿！用Advanced SystemCare 16给你的旧电脑来一次深度SPA（附保姆级设置指南）

Firefox兼容性深度解析：GM_addElement底层机制揭秘

【技术视角】从0到1拆解机乎AI：AI社交平台的技术架构与产品设计

2026 AI社交发展报告：Agent社交如何成为下一代数字生态的核心

如何永久保存微信聊天记录？这个免费工具让你的数字记忆永不丢失

从零开始学iOS开发（第三十二篇）：SwiftUI 拖拽交互 —— 构建流畅的拖放体验

软考高项-案例万金油（进度成本纠偏）

上市公司会计审计报告5种意见的含义，看完秒懂

终极指南：3步掌握LSPatch免Root模块注入框架

CSS怎样调整弹性项目排列顺序_使用order属性轻松控制DOM显示顺序

Dev Containers + Kubernetes本地沙箱联动失效？2026年3大厂商联合认证的5步跨集群同步协议（含YAML原子模板）

从开发到部署：用Docker Compose封装你的MySQL+phpMyAdmin本地开发环境（附完整yml文件）

达梦DM8 JDBC连接串配置避坑指南：从单机到集群，这些参数你配对了吗？

别再手动排UV了！3dMax 2024搭配这5款插件，效率直接翻倍（附保姆级安装教程）

别再手动填地址了！LabVIEW 2020 Modbus TCP批量读取与数据解析技巧分享

Boss-Key终极指南：Windows窗口智能隐藏与隐私保护完整教程

基于多维数据分析的PID参数智能优化系统：工业级控制性能提升框架

敏捷教练的必备工具箱：让团队真正“敏捷”起来

用LVGL v8.3设计一个简洁的状态栏：从布局对齐到响应式适配的完整实践