当前位置：首页 > article >正文

DDPM优化目标公式推导

article 2025/11/4 15:25:21

DDPM优化目标公式推导

DDPM优化目标公式推导
- - **1. 问题定义**
  - **2. 优化目标：最大化对数似然**
  - **3. 变分下界的分解**
  - **4. 关键步骤：简化 KL 散度项**
  - - **(a) 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解**
    - **(b) 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ **
    - **(c) KL 散度的闭式解**
  - **5. 最终优化目标**
  - **关键结论**
补充内容（优化思路）
- - 变分下界（VLB）最终简化公式的逐项解析与优化思路
  - - **1. 重构项 (Reconstruction Term)**
    - **2. 去噪匹配项 (Denoising Matching Term)**
    - **3. 先验匹配项 (Prior Matching Term)**
  - **整体优化思路分析**
  - - **1. 核心优化目标**
    - **2. 实际训练简化**
    - **3. 物理意义图解**
    - **4. 为什么此优化有效？**
  - **总结**

DDPM优化目标公式推导

DDPM（Denoising Diffusion Probabilistic Models）的优化目标推导基于变分下界（Variational Lower Bound, VLB） 或 证据下界（Evidence Lower Bound, ELBO）。以下是详细推导过程：

1. 问题定义

目标：学习一个模型 $p_\theta(\mathbf{x}_0)$ 逼近真实数据分布 $q(\mathbf{x}_0)$ 。
前向过程（扩散过程）：
固定方差序列 $\beta_1, \dots, \beta_T$ ，定义马尔可夫链：
$q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}), \quad q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$
反向过程（生成过程）：
学习参数化的马尔可夫链：
$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t), \quad p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$

2. 优化目标：最大化对数似然

目标是最大化 $\log p_\theta(\mathbf{x}_0)$ ，但直接计算困难，转而最大化其变分下界：
$\log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \triangleq \text{VLB}$

3. 变分下界的分解

将 VLB 展开并分解：
$\begin{align*} \text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right] \\ \end{align*}$
利用马尔可夫性质，改写为：
$\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) + \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)}{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} - \sum_{t=1}^T \log \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1})}{q(\mathbf{x}_{t-1} | \mathbf{x}_0)} \right] + C$
最终简化为：
$\boxed{\text{VLB} = \mathbb{E}_{q} \left[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \right] - \sum_{t=2}^T \mathbb{E}_{q} \left[ D_\text{KL} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \right) \right] - D_\text{KL} \left( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \right)}$

详细过程请参考DDPM优化目标公式推导（详细）

4. 关键步骤：简化 KL 散度项

(a) 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解

由贝叶斯公式：
$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$
其中：
$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t, \quad \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$
（记 $\alpha_t = 1 - \beta_t$ , $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ ）

(b) 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$

设 $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$ 。
为匹配后验分布，选择：
$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \tilde{\boldsymbol{\mu}}_t \left( \mathbf{x}_t, \frac{\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_\theta}{\sqrt{\bar{\alpha}_t}} \right)$
代入闭式解得：
$\boldsymbol{\mu}_\theta = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)$

© KL 散度的闭式解

两个高斯分布的 KL 散度为：
$D_\text{KL}(\mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) \parallel \mathcal{N}(\boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)) = \frac{1}{2} \left[ \log \frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} - d + \text{tr}(\boldsymbol{\Sigma}_2^{-1} \boldsymbol{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^\top \boldsymbol{\Sigma}_2^{-1} (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1) \right]$
假设 $\boldsymbol{\Sigma}_\theta = \sigma_t^2 \mathbf{I}$ （常取 $\sigma_t^2 = \beta_t$ 或 $\tilde{\beta}_t$ ），则：
$D_\text{KL} = \frac{1}{2\sigma_t^2} \| \tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta \|^2 + C$
代入 $\boldsymbol{\mu}_\theta$ 和 $\tilde{\boldsymbol{\mu}}_t$ 的表达式：
$\tilde{\boldsymbol{\mu}}_t - \boldsymbol{\mu}_\theta = \frac{\beta_t}{\sqrt{\alpha_t} \sqrt{1 - \bar{\alpha}_t}} \left( \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)$
其中 $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$ 。最终：
$\boxed{D_\text{KL} \propto \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]}$

5. 最终优化目标

忽略常数项和权重，DDPM 的简化目标为：
$\mathcal{L}_\text{simple}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]$
其中：

$\sim \text{Uniform}(1, T)$
$\mathbf{x}_0 \sim q(\mathbf{x}_0)$
$\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$

关键结论

DDPM 通过训练一个网络 $\boldsymbol{\epsilon}_\theta$ 预测添加到样本中的噪声，最小化噪声预测的均方误差，从而实现数据生成。此目标等价于对数据分布的梯度（分数）进行匹配，与基于分数的生成模型有深刻联系。

补充内容（优化思路）

变分下界（VLB）最终简化公式的逐项解析与优化思路

最终VLB公式为：
$\begin{align*} \text{VLB} = & \;\mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)} \Big[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \Big] \\ & - \sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \left[ D_{\text{KL}} \Big( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \Big) \right] \\ & - D_{\text{KL}} \Big( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \Big) \end{align*}$

1. 重构项 (Reconstruction Term)

$\mathbb{E}_{q(\mathbf{x}_1 | \mathbf{x}_0)} \Big[ \log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) \Big]$

含义：
衡量从第一步带噪样本 $\mathbf{x}_1$ 重建原始数据 $\mathbf{x}_0$ 的质量。
- $q(\mathbf{x}_1 | \mathbf{x}_0)$ ：前向过程第一步（ $\mathbf{x}_0 \to \mathbf{x}_1$ )
- $p_\theta(\mathbf{x}_0 | \mathbf{x}_1)$ ：反向生成过程的第一步（ $\mathbf{x}_1 \to \mathbf{x}_0$ )
物理意义：
评估模型在轻度噪声水平（ $t = 1$ ）下的数据重建能力。
对于图像数据，此项常建模为离散分布（如像素级交叉熵）或连续分布（如高斯似然）。
优化作用：
确保生成过程最终输出高质量样本。实际训练中此项影响较小（因 $t = 1$ 噪声水平低）。

2. 去噪匹配项 (Denoising Matching Term)

$\sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t | \mathbf{x}_0)} \left[ D_{\text{KL}} \Big( q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) \Big) \right]$

含义：
核心优化项！要求反向生成过程 $p_\theta$ 匹配前向过程的后验分布 $q$ 。
- $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ ：已知 $\mathbf{x}_0$ 和 $\mathbf{x}_t$ 时 $\mathbf{x}_{t-1}$ 的真实后验分布（可解析计算的高斯分布）
- $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ ：参数化的反向生成模型（神经网络预测）
物理意义：
在每一步 $t$ ，强制生成模型从 $\mathbf{x}_t$ 预测 $\mathbf{x}_{t-1}$ 的分布接近理论最优去噪分布。
关键推导结论：
该KL散度可简化为 噪声预测的均方误差：
$D_{\text{KL}} \propto \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta (\mathbf{x}_t, t) \|^2$
其中 $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ ， $\boldsymbol{\epsilon}_\theta$ 是预测噪声的神经网络。
优化作用：
主导整个训练过程（占损失函数权重的99%以上）。
将复杂的分布匹配问题转化为简单的监督学习：训练网络 $\boldsymbol{\epsilon}_\theta$ 预测加入的噪声 $\boldsymbol{\epsilon}$ 。

3. 先验匹配项 (Prior Matching Term)

$D_{\text{KL}} \Big( q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T) \Big)$

含义：
衡量前向过程最终分布 $q(\mathbf{x}_T | \mathbf{x}_0)$ 与预设先验 $p(\mathbf{x}_T)$ 的相似度。
- $q(\mathbf{x}_T | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_T; \sqrt{\bar{\alpha}_T} \mathbf{x}_0, (1-\bar{\alpha}_T)\mathbf{I})$
- $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ （标准高斯分布）
物理意义：
确保前向过程结束时，噪声分布接近标准高斯分布（生成过程的起点）。
优化作用：
- 当 $\bar{\alpha}_T \approx 0$ 时（DDPM通常满足），此项趋近于0（因 $q(\mathbf{x}_T|\mathbf{x}_0) \approx \mathcal{N}(0, \mathbf{I})$ )。
- 实际训练中常被忽略，因其不依赖可训练参数 $\theta$ 且值极小。

整体优化思路分析

1. 核心优化目标

最大化 $\log p_\theta(\mathbf{x}_0)$ 的下界（VLB），等价于最小化：
$\mathcal{L}_{\text{VLB}} = -\text{VLB} = \mathcal{L}_0 + \sum_{t=2}^T \mathcal{L}_{t} + \mathcal{L}_T$
其中：

$\mathcal{L}_0 = -\mathbb{E}[\log p_\theta(\mathbf{x}_0|\mathbf{x}_1)]$ （重构损失）
$\mathcal{L}_{t} = \mathbb{E}[D_{\text{KL}}(q \parallel p_\theta)]$ （去噪匹配损失）
$\mathcal{L}_T = D_{\text{KL}}(q(\mathbf{x}_T|\mathbf{x}_0) \parallel p(\mathbf{x}_T))$ （先验匹配损失）

2. 实际训练简化

忽略 $\mathcal{L}_T$ ：
因 $\bar{\alpha}_T \approx 0$ ，此项可忽略（接近0）。
简化 $\mathcal{L}_0$ ：
用均方误差替代离散分布建模（如对于图像数据）。
主导项 $\mathcal{L}_{t}$ 的转化：
通过数学推导，将KL散度转化为噪声预测损失：
$\mathcal{L}_{t} \propto \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}, t} \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2$
均匀时间步采样：
为稳定训练，对 $\sim \text{Uniform}\{1,...,T\}$ 采样并去权重：
$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t,\mathbf{x}_0,\boldsymbol{\epsilon}} \| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2$

3. 物理意义图解

生成过程（反向）: x_T ≈ N(0,I) → [pθ(x_{T-1}|x_T)] → ... → [pθ(x_0|x_1)] → x_0↑ 匹配          ↑ 匹配          ↑ 匹配
前向过程      : x_0 → [q(x1|x0)] → x_1 → ... → [q(x_T|x_{T-1})] → x_T重构项↑      去噪匹配项↑           先验匹配项↑

4. 为什么此优化有效？

解耦复杂性：
将高维数据分布匹配问题分解为 $T$ 个简单的高斯分布匹配任务。
渐进式优化：
通过时间步 $t$ 控制噪声水平，从易（高噪声）到难（低噪声）逐步训练。
闭式解指导：
利用前向过程后验 $q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)$ 的解析解提供训练目标。
隐式分数匹配：
噪声预测等价于学习数据分布的梯度场（ $\boldsymbol{\epsilon}_\theta \propto -\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)$ ）。

总结

项	含义	优化作用	实际处理
重构项	从 $\mathbf{x}_1$ 重建 $\mathbf{x}_0$	保证最终输出质量	保留或用MSE替代
去噪匹配项	匹配反向生成与前向后验分布	核心训练目标（>99%权重）	转化为噪声预测损失
先验匹配项	对齐 $\mathbf{x}_T$ 与标准高斯	确保生成起点正确	忽略（值≈0）

DDPM的创新优化思路：
将生成建模问题转化为序列化的噪声预测任务，通过：

利用前向过程后验的闭式解提供训练目标
将KL散度转化为均方误差损失
均匀采样时间步简化训练
使扩散模型可稳定训练于高维数据（如图像、音频），成为生成式AI的核心框架。

DDPM优化目标公式推导

DDPM优化目标公式推导

1. 问题定义

2. 优化目标：最大化对数似然

3. 变分下界的分解

4. 关键步骤：简化 KL 散度项

(a) 后验分布 q ( x t − 1 ∣ x t , x 0 ) q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) q(xt−1​∣xt​,x0​) 的闭式解

(b) 参数化均值 μ θ ( x t , t ) \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) μθ​(xt​,t)

© KL 散度的闭式解

5. 最终优化目标

关键结论

补充内容（优化思路）

变分下界（VLB）最终简化公式的逐项解析与优化思路

1. 重构项 (Reconstruction Term)

2. 去噪匹配项 (Denoising Matching Term)

3. 先验匹配项 (Prior Matching Term)

整体优化思路分析

1. 核心优化目标

2. 实际训练简化

3. 物理意义图解

4. 为什么此优化有效？

总结

相关文章：

(a) 后验分布 $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 的闭式解

(b) 参数化均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$