当前位置：首页 > news >正文

【人工智能前沿弄潮】——生成式AI系列：Diffusers学习（1）了解Pipeline 、模型和scheduler

news 2026/2/11 3:02:50

Diffusers旨在成为一个用户友好且灵活的工具箱，用于构建针对您的用例量身定制的扩散系统。工具箱的核心是模型和scheduler。虽然DiffusionPipeline为了方便起见将这些组件捆绑在一起，但您也可以拆分管道并单独使用模型和scheduler来创建新的扩散系统。

在本教程中，您将学习如何使用模型和scheduler来组装用于推理的扩散系统，从基本管道开始，然后发展到稳定扩散管道。

1、解构Diffusion Model基本Pipeline

Pipeline是运行模型进行推理的一种快速简便的方法，生成图像需要不超过四行代码：

from diffusers import DDPMPipelineddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256").to("cuda")
image = ddpm(num_inference_steps=25).images[0]
image

这非常容易，但是Pipeline是怎么做到的呢？让我们分解Pipeline，看看发生了什么。

在上面的示例中，管道包含一个UNet2DModel模型和一个DDPMScheduler。

Pipeline通过获取所需输出大小的随机噪声并将其多次传递到模型中来对图像进行去噪。在每个时间步，模型预测噪声残余，scheduler使用它来预测噪声较小的图像。Pipeline重复此过程，直到到达指定数量的推理步骤的末尾。

要分别使用模型和scheduler重新创建Pipeline，让我们编写自己的去噪过程。

在这里插入图片描述

加载模型和scheduler：

from diffusers import DDPMScheduler, UNet2DModelscheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")

设置运行去噪过程的时间步数：

scheduler.set_timesteps(50)

设置scheduler时间步长会创建一个张量，其中包含均匀间隔的元素，在本例中为50。每个元素对应于模型对图像进行去噪的时间步长。稍后创建去噪循环时，您将迭代此张量以对图像进行去噪：

scheduler.timesteps

tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720,700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440,420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160,140, 120, 100,  80,  60,  40,  20,   0])

创建一些与所需输出形状相同的随机噪声：

import torchsample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda")

现在编写一个循环来迭代时间步长。在每个时间步长，模型都会进行UNet2DModel.forward() 传递并返回带噪声的残差。scheduler的 step()方法接受带噪声的残差、时间步长和输入，并预测前一个时间步长的图像。该输出成为去噪循环中模型的下一个输入，它会重复，直到到达时间步长数组的末尾。

input = noisefor t in scheduler.timesteps:with torch.no_grad():noisy_residual = model(input, t).sampleprevious_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sampleinput = previous_noisy_sample

这是整个去噪过程，您可以使用相同的模式来编写任何扩散系统。

最后一步是将去噪输出转换为图像：

from PIL import Image
import numpy as npimage = (input / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
image = Image.fromarray((image * 255).round().astype("uint8"))
image

在下一节中，您将测试您的技能，并分解更复杂的稳定扩散Pipeline。步骤或多或少是一样的。您将初始化必要的组件，并设置时间步数来创建时间步数数组。时间步数数组用于去噪循环，对于该数组中的每个元素，模型预测噪声较小的图像。去噪循环在时间步上迭代，在每个时间步上，它输出一个嘈杂的残差，scheduler使用它来预测前一个时间步上噪声较小的图像。重复此过程，直到到达时间步长数组的末尾。我们来试试看吧！

2、解构Stable Diffusion pipeline

Stable Diffusion是一种文本到图像的潜在扩散模型。它被称为潜在扩散模型，因为它使用图像的低维表示而不是实际的像素空间，这使得它的内存效率更高。编码器将图像压缩成更小的表示，解码器将压缩的表示转换回图像。对于文本到图像模型，您需要一个标记器和一个编码器来生成文本嵌入。从前面的例子中，您已经知道您需要一个UNet模型和一个Scheduler。

如您所见，这已经比仅包含UNet模型的DDPM管道更复杂。Stable Diffusion模型有三个独立的预训练模型。

💡 阅读 How does Stable Diffusion work?了解有关VAE、UNet和文本编码器模型的更多详细信息。

现在您知道Stable Diffusion pipeline需要什么了，使用from_pretrained()方法加载所有这些组件。您可以在预训练的runwayml/stable-diffusion-v1-5checkpoint中找到它们，每个组件都存储在单独的子文件夹中：

from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMSchedulervae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder")
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")

代替默认的PNDMScheduler，将其换成UniPCMultistepScheduler，看看插入不同的Scheduler有多容易：

from diffusers import UniPCMultistepSchedulerscheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

为了加快推理速度，请将模型移动到GPU，因为与调度程序不同，它们具有可训练的权重：

torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)

2.1 创建文本嵌入

下一步是标记文本以生成embedding。文本用于调节UNet模型并将扩散过程引导到类似于输入提示符的东西。

💡注： guidance_scale参数决定了在生成图像时应该给prompt多少权重。

如果您想生成其他内容，请随意选择您喜欢的任何prompt！

prompt = ["a photograph of an astronaut riding a horse"]
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance
generator = torch.manual_seed(0)  # Seed generator to create the inital latent noise
batch_size = len(prompt)

标记文本并从提示生成embeddings ：

text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)with torch.no_grad():text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

您还需要生成**无条件文本embeddings **，它们是填充标记的embeddings 。这些需要具有与条件text_embeddings相同的形状（batch_size和seq_length）：

max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

让我们将条件和无条件嵌入连接到一个批处理中，以避免进行两次前向传递：

text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

2.2 制造随机噪音

接下来，生成一些初始随机噪声作为扩散过程的起点。这是图像的潜在表示（latent representation），它将逐渐去噪。在这一点上，潜在图像小于最终图像尺寸，但没关系，因为模型稍后会将其转换为最终的512x512图像尺寸。

💡注：高度和宽度除以8，因为vae模型有3个下采样层。您可以通过运行以下命令来检查：

2 ** (len(vae.config.block_out_channels) - 1) == 8

latents = torch.randn((batch_size, unet.in_channels, height // 8, width // 8),generator=generator,
)
latents = latents.to(torch_device)

2.3 去噪图像

首先使用**初始噪声分布sigma（噪声标度值）**缩放输入，这是改进scheduler（如UniPCMultistepScheduler）所必需的：

latents = latents * scheduler.init_noise_sigma

最后一步是创建去噪循环，将潜在的纯噪声逐步转换为提示所描述的图像。记住，去噪循环需要做三件事：

设置在去噪期间使用的scheduler的时间步长。
迭代时间步长。
在每个时间步，调用UNet模型来预测噪声残余并将其传递给scheduler以计算先前的噪声样本。

from tqdm.auto import tqdmscheduler.set_timesteps(num_inference_steps)for t in tqdm(scheduler.timesteps):# 如果我们正在进行无分类器引导以避免进行两次前向传递，则扩展latents。latent_model_input = torch.cat([latents] * 2)latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)# 预测噪声残余with torch.no_grad():noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample# 执行guidancenoise_pred_uncond, noise_pred_text = noise_pred.chunk(2)noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)# 计算先前的噪声样本x_t->x_t-1latents = scheduler.step(noise_pred, t, latents).prev_sample

2.4 解码图像

最后一步是使用vae将潜在表示解码为图像并获得带有样本的解码输出：

# 用vae缩放和解码图像latents
latents = 1 / 0.18215 * latents
with torch.no_grad():image = vae.decode(latents).sample

最后，将图像转换为PIL. Image以查看您生成的图像！

image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
images = (image * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
pil_images[0]

在这里插入图片描述

【人工智能前沿弄潮】——生成式AI系列：Diffusers学习（1）了解Pipeline 、模型和scheduler

1、解构Diffusion Model基本Pipeline

2、解构Stable Diffusion pipeline

2.1 创建文本嵌入

2.2 制造随机噪音

2.3 去噪图像

2.4 解码图像

相关文章：

【人工智能前沿弄潮】——生成式AI系列：Diffusers学习（1）了解Pipeline 、模型和scheduler

TypeScript 非空断言

Python编程——谈谈函数的定义、调用与传入参数

在Ubuntu中使用Docker启动MySQL8的天坑

Python3.x String内置函数大全

Go异常处理机制panic和recover

QMainwindow窗口

P5735 【深基7.例1】距离函数

prometheus告警发送组件部署

CAPL - XML和TestModule结合实现测试项可选

Latex安装与环境配置（TeXlive、TeXstudio与VS code的安装）编译器+编辑器与学习应用

STM32 F103C8T6学习笔记3：串口配置—串口收发—自定义Printf函数

python中字符串内建函数篇4

并发下如何使用redis存储列表数据

Leecode螺旋矩阵 II59

echarts 横向柱状图

Vue3 —— to 全家桶及源码学习

(第三篇) ansible-kubeadm在线安装高可以用集群（）

flutter开发实战-颜色Color与16进制转换

Linux（进程地址空间）

Python爬虫实战：研究MechanicalSoup库相关技术

React 第五十五节 Router 中 useAsyncError的使用详解

51c自动驾驶~合集58

【JavaEE】-- HTTP

java 实现excel文件转pdf | 无水印 | 无限制

【大模型RAG】Docker 一键部署 Milvus 完整攻略

Java 加密常用的各种算法及其选择

VTK如何让部分单位不可见

CMake 从 GitHub 下载第三方库并使用

Redis的发布订阅模式与专业的 MQ（如 Kafka, RabbitMQ）相比，优缺点是什么？适用于哪些场景？