当前位置：首页 > news >正文

加速PyTorch模型训练：自动混合精度（AMP）

news 2025/7/12 19:55:03

在深度学习领域，模型训练的速度和效率尤为重要。为了提升训练速度并减少显存占用（较复杂的模型中），PyTorch自1.6版本起引入了自动混合精度（Automatic Mixed Precision, AMP）功能。

AMP简单介绍

是一种训练技巧，允许在训练过程中使用低于32位浮点的数值格式（如16位浮点数），从而节省内存并加速训练过程。PyTorch 的 AMP 模块能够自动识别哪些操作可以安全地使用16位精度，而哪些操作需要保持32位精度以保证数值稳定性和准确性。

官网地址：https://pytorch.org/docs/stable/amp.html
在这里插入图片描述

为什么使用AMP

在某些上下文中，torch.FloatTensor（FP32）有其优势，而在其他情况下，torch.HalfTensor（FP16）则更具优势。FP16的优势包括减少显存占用、加快训练和推断计算以及更好地利用CUDA设备的Tensor Core。然而，FP16也存在数值范围小和舍入误差等问题。通过混合精度训练，可以在享受FP16带来的好处的同时，避免其劣势。

两个核心组件

PyTorch 的 AMP 模块主要包含两个核心组件：autocast 和 GradScaler。

autocast：这是一个上下文管理器，它会自动将张量转换为合适的精度。当张量被传递给运算符时，它们会被转换为16位浮点数（如果支持的话），这有助于提高计算速度并减少内存使用。
GradScaler：这是一个用于放大梯度的类，因为在混合精度训练中，梯度可能会非常小，以至于导致数值稳定性问题。GradScaler 可以帮助解决这个问题，它在反向传播之前放大损失，然后在更新权重之后还原梯度的尺度。

代码示例

import torch
import torch.nn as nn
import torch.optim as optim
from torch.amp import GradScaler, autocast
import time
torch.manual_seed(42)
# A simple Model
class MLP(nn.Module):def __init__(self):super(MLP, self).__init__()self.linear1 = nn.Linear(10, 100)self.linear2 = nn.Linear(100, 10)def forward(self, x):x = torch.relu(self.linear1(x))x = self.linear2(x)return x# init model
model = MLP().cuda()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)# GradScaler
scaler = GradScaler(device='cuda')# random data
inputs = torch.randn(100, 10).cuda()
targets = torch.randint(0, 10, (100,)).cuda()# train
for epoch in range(1):start_time = time.time() print(f"inputs dtype:{inputs.dtype}")# autocastwith autocast('cuda'):outputs = model(inputs)print(f"outputs dtype:{outputs.dtype}")loss = criterion(outputs, targets)print(f"loss dtype:{loss.dtype}")optimizer.zero_grad(set_to_none=True)scaler.scale(loss).backward()scaler.step(optimizer)scaler.update()print(f"Epoch {epoch + 1}, Loss: {loss.item():.4f}")end_time = time.time() elapsed_time = end_time - start_time allocated_memory = torch.cuda.memory_allocated() / 1024**2  reserved_memory = torch.cuda.memory_reserved() / 1024**2  print(f"Single Batch, Single Epoch with AMP, Loss: {loss.item():.4f}")print(f"Time taken: {elapsed_time:.4f} seconds")print(f"Allocated memory: {allocated_memory:.2f} MB")print(f"Reserved memory: {reserved_memory:.2f} MB")

输出如下：
Time taken for epoch 1: 0.0116 seconds
Allocated memory: 20.64 MB
Reserved memory: 44.00 MB

不使用AMP（更快了）：
Time taken for epoch 1: 0.0024 seconds
Allocated memory: 20.64 MB
Reserved memory: 44.00 MB

由于上面的示例是一个很小的模型（只有几层的小型网络），其本身的计算量不大，因此即使采用了FP16精度，也难以观察到明显的加速效果。同时，如果模型中的某些层无法有效利用Tensor Cores（例如一些自定义操作，非标准层），那么整个流程可能会受到限制。所以感受不到有计算优化。

在这里插入图片描述

加速PyTorch模型训练：自动混合精度（AMP）

AMP简单介绍

为什么使用AMP

两个核心组件

代码示例

相关文章：

加速PyTorch模型训练：自动混合精度（AMP）

【py】python安装教程（Windows系统，python3.13.2版本为例）

Django REST Framework：如何获取序列化后的ID

QT修仙笔记事件大圆满闹钟大成

Leetcode - 149双周赛

解决 ComfyUI-Impact-Pack 中缺少 UltralyticsDetectorProvider 节点的问题

使用Kickstart配置文件封装操作系统实现Linux的自动化安装

Android笔记【snippet】

zsh: command not found: conda

【知识科普】CPU,GPN,NPU知识普及

【C++八股】struct和Class的区别

鹧鸪云光伏仓储、物料管理软件详细功能

bazel 小白理解

MVC（Model-View-Controller）framework using Python ,Tkinter and SQLite

WPF 设置宽度为父容器宽度的一半

java项目之在线心理评测与咨询管理系统（源码+文档）

【STM32系列】利用MATLAB配合ARM-DSP库设计FIR数字滤波器（保姆级教程）

Springboot框架扩展功能的使用

yum报错 Could not resolve host: mirrorlist.centos.org

docker使用dockerfile打包镜像（docker如何打包）

全球首个30米分辨率湿地数据集(2000—2022)

三体问题详解

dify打造数据可视化图表

Unity | AmplifyShaderEditor插件基础（第七集：平面波动shader）

CVE-2020-17519源码分析与漏洞复现(Flink 任意文件读取)

[ACTF2020 新生赛]Include 1(php://filter伪协议)

Spring Security 认证流程——补充

pycharm 设置环境出错

论文阅读：Matting by Generation

Java并发编程实战 Day 11：并发设计模式