当前位置：首页 > article >正文

细读经典： ZeRO

article 2026/3/16 6:18:36

论文链接https://arxiv.org/pdf/1910.02054训练并行的几种方式1. Pipeline Parallelism (PP)2. Model Parallelism (MP)3. Data ParallelismSo, how can we overcome the limitations of existing solutions and train large models more efficiently? To answer this question, we first analyze the full spectrum of memory consumption of the existing systems on model training and classify it into two parts: 1) For large models, the majority of the memory is occupied by model states which include the optimizer states (such as momentum and variances in Adam [6]), gradients, and parameters. 2) The remaining memory is consumed by activation, temporary buffers and unusable fragmented memory, which we refer to collectively as residual states. We develop ZeRO— Zero Redundancy Optimizer — to optimize memory efficiency on both while obtaining high compute and communication efficiency. As these two parts face different challenges, we develop and discuss their solutions correspondingly该论文提供了一种新的数据并行的方式Zero-DP提出了Zero-R用于 optimize the residual memory consumed by these three factors respectively.1) For activations (stored from forward pass in order to perform backward pass), we noticed checkpointing [7] helps but not sufficient for large models. Thus ZeRO-R optimizes activation memory by identifying and removing activation replication in existing MP approaches through activation partitioning. It also offloads activations to CPU when appropriate. 2) ZeRO-R defines appropriate size for temporary buffers to strike for a balance of memory and computation efficiency. 3) We observe fragmented memory during training due to variations in the lifetime of different tensors. Lack of contiguous memory due to fragmentation can cause memory allocation failure, even when enough free memory is available. ZeRO-R proactively m最后放一段GPT的总结这篇论文是ZeRO: Memory Optimizations Toward Training Trillion Parameter Models作者Samyam Rajbhandari、Jeff Rasley、Olatunji Ruwase、Yuxiong He来自 Microsoft后来成为 DeepSpeed 的核心技术。这篇论文是大模型训练系统领域最重要的论文之一提出了ZeROZero Redundancy Optimizer核心目标是解决大模型训练的 GPU 内存瓶颈问题。(Hugging Face)下面我从问题 → 方法 → ZeRO 三个阶段 → 效果帮你系统讲清楚。一、论文要解决的问题随着深度学习模型规模快速增长BERT、GPT 等训练面临一个核心瓶颈GPU显存不够。在标准数据并行Data Parallelism中每个 GPU 都需要保存完整的模型状态模型训练时需要保存三类状态ParametersGradientsOptimizer states如 Adam 的 m / v其中optimizer state 是最大的。如果使用 Adammemory ≈ parameters gradients 2 × optimizer states也就是说每个参数需要 ~16 bytesFP16训练举例模型规模需要显存1B 参数~16GB10B 参数~160GB这就是为什么普通数据并行无法训练大模型。(Hugging Face)二、传统方法的局限论文分析了三种已有方案1 Data Parallelism特点每个 GPU 保存完整模型只分 batch问题memory O(model size)无法扩展。2 Model Parallelism代表Megatron-LMMesh TensorFlow方法把一个 layer 拆到多个 GPU。问题通信量巨大编程复杂多节点效率差论文实验40B 模型跨节点训练GPU 利用率 5% peak。 (Hugging Face)3 Pipeline Parallelism代表GPipe问题pipeline bubble实现复杂三、核心思想Zero Redundancy论文观察到一个关键事实数据并行存在大量“冗余状态”。在 DP 中GPU0 GPU1 GPU2 GPU3每个 GPU 都存parameters gradients optimizer states这些都是完全一样的副本。ZeRO 的思想消除数据并行中的冗余副本让不同 GPU只保存一部分状态。四、ZeRO 三个阶段最重要部分ZeRO逐步分片三类状态。Stage 1Optimizer State Partition只切分optimizer states例如GPU0 - optimizer 0-25% GPU1 - optimizer 25-50% GPU2 - optimizer 50-75% GPU3 - optimizer 75-100%而parameters - full copy gradients - full copy优点显存减少4×Stage 2Gradient Partition进一步切分gradients现在状态是否分片parameters❌gradients✅optimizer✅优点显存减少≈ 8×Stage 3Parameter Partition最后连参数也分片。每个 GPU 只保存1/N parameters 1/N gradients 1/N optimizer训练时forward 前all-gather 参数backward 后reduce-scatter 梯度这样memory O(model_size / N)这就是 ZeRO 的关键突破。五、ZeRO 的通信机制ZeRO 的核心通信模式ForwardAll-Gather parameters每层计算前收集参数。BackwardReduce-Scatter gradients梯度直接分片回收。相比传统AllReduce gradientsZeRO 更节省内存。六、论文实验结果实验规模400 GPUs结果100B 模型可训练15 PFLOPS throughputsuper-linear speedup对比 SOTA指标提升模型规模8×训练性能10×并成功训练17B Turing-NLG 模型。 (Hugging Face)七、论文的历史意义这篇论文几乎定义了现代大模型训练架构。之后很多技术都是它的延伸ZeRO-OffloadCPU 显存扩展。ZeRO-InfinityNVMe 扩展。PyTorch FSDPFully Sharded Data Parallel几乎就是ZeRO Stage3 的实现。八、一句话总结ZeRO 的核心贡献通过分片 optimizer / gradients / parameters消除数据并行的冗余显存实现O(model_size / N)的内存扩展能力。从而让Trillion parameter training成为可能。如果你愿意我可以再帮你深入讲一个工程层面非常关键的问题为什么 Adam optimizer 会占 3× 参数内存以及ZeRO 如何把它降到 1/N—— 这个其实和你之前问的Adam memory usage是完全相关的。

细读经典： ZeRO

相关文章：

细读经典： ZeRO

设备预测性维护服务商选择的关键维度

一套全方位零售数字化经营系统：技术解析与业务赋能

linux——目录及文件操作

【Python数据分析论文模版】基于Python的淘宝网手机销售数据分析与可视化

跨端融合，精准匹配：专业人才招聘管理App的技术创新与行业实践

古装微短剧《嘉庆君游台湾》开机霍政谚全力以赴演绎永琰

智慧养殖鱼类病害的自动识别与分类助水产养殖从业者及时诊断鱼病鱼类疾病识别数据集鱼类养殖检测数据集第10561期

基于Spring Boot的乡村信息管理系统设计与实践

3月16日直播丨面向新一代硬件，CANN技术架构的变与不变

《Azul报告：62%的Java开发者已在写AI代码，这5个Java+AI实战场景你必须会》

制造业信息化系统开发工程师 - 学习资料汇总

PFM和FCCM的区别是什么？

基于SpringBoot的运动服装销售系统设计与实现

基于嵌入式的数据库SQLite

知识点总结三

一次性熔断保险丝 vs PPTC 选型参数全解析（硬件工程师必备）

事件驱动在AI原生应用领域的应用实践分享

STM32矩阵键盘驱动实战解析

褪去故事滤镜：重建精品可可的“结构语言”

《低电压设计必看！轨到轨运放选型、电路搭建与常见坑避坑手册》

从零开发微信小程序+若依后端项目：本地全流程开发，从环境搭建到前后端联调跑通

TTTTT

深扒GEO优化行业潜规则：全网首次深度拆解底层逻辑

评判方法：你现在正使用的服装ERP软件该升级吗？

【数据集】省级建成区绿化覆盖率数据（2006-2023年）

高效查重工具评测：9大方案助力论文质量提升

大数据领域Spark的数据存储与读取方式

论文查重全攻略：9款工具深度评测与优化建议

关于4G低功耗机器排查离线工作问题总结一