当前位置：首页 > news >正文

以EM算法为例介绍坐标上升（Coordinate Ascent）算法：中英双语

news 2025/11/10 20:48:40

中文版

什么是 Coordinate Ascent 算法？

Coordinate Ascent（坐标上升）是一种优化算法，它通过在每次迭代时优化一个变量（或一个坐标），并保持其他变量不变，逐步逼近最优解。与坐标下降（Coordinate Descent）算法相对，坐标上升通常用于求解带有多个变量的优化问题，特别是在优化函数在每个维度上都是可分的情况下。

在每次迭代中，Coordinate Ascent算法通过选择一个变量来最大化该变量对目标函数的贡献，然后再选择下一个变量进行优化。它通过反复更新每个坐标来寻找最优解。虽然这个过程可能不会全局收敛，但它在很多实际问题中表现得很好，特别是在目标函数较为简单且每个坐标可以独立优化时。

以EM算法为例介绍Coordinate Ascent

EM（Expectation-Maximization）算法是一种常见的优化算法，常用于统计学和机器学习中，尤其是当数据中有隐变量时。EM算法的基本思路是通过迭代求解期望步骤（E步）和最大化步骤（M步）来优化参数。

我们可以将EM算法视为一种坐标上升方法。每一轮的E步和M步都可以视为分别优化不同的“坐标”，其中E步优化期望函数，M步最大化该期望函数。

EM算法中的坐标上升

假设我们有一个隐变量模型，其目标是最大化对数似然函数：
$\mathcal{L}(\theta) = \log P(X, Z | \theta)$
其中：

( $X$ ) 是观察数据，
( $Z$ ) 是隐变量，
( $\theta$ ) 是模型的参数。

由于隐变量 ( $Z$ ) 是不可观测的，直接求解对数似然函数是困难的。因此，EM算法通过迭代的方法来解决这个问题。

E步（Expectation step）：
在E步中，我们计算隐变量的期望值，即给定当前参数估计 ( $\theta^{(t)}$ ) 下，隐变量 ( $Z$ ) 的条件概率分布：
$Q(\theta, \theta^{(t)}) = \mathbb{E}_{Z | X, \theta^{(t)}} \left[ \log P(X, Z | \theta) \right]$
这是一个坐标上升的过程，因为我们通过固定当前的参数 ( $\theta^{(t)}$ ) 来最大化隐变量的期望。
M步（Maximization step）：
在M步中，我们最大化E步得到的期望函数 ( $Q(\theta, \theta^{(t)})$ ) 来更新参数 ( $\theta$ )：
$\theta^{(t+1)} = \arg\max_\theta Q(\theta, \theta^{(t)})$
这也可以视为一次坐标上升，因为我们固定隐变量的期望并优化参数 ( $\theta$ )。

在EM算法中，E步和M步交替进行，每一轮都像是在“沿着不同的坐标轴上升”，通过迭代更新参数和隐变量的估计值。

数学公式

假设我们有如下的对数似然函数：
$\mathcal{L}(\theta) = \log P(X | \theta) = \log \sum_{Z} P(X, Z | \theta)$

由于隐变量 ( Z ) 无法直接观测，我们在E步中计算隐变量的条件期望：
$Q(\theta, \theta^{(t)}) = \mathbb{E}_{Z | X, \theta^{(t)}} \left[ \log P(X, Z | \theta) \right]$

然后在M步中最大化这个期望：
$\theta^{(t+1)} = \arg\max_\theta Q(\theta, \theta^{(t)})$

这两个步骤交替进行，直到收敛为止。

例子：高斯混合模型（Gaussian Mixture Model, GMM）

高斯混合模型是一种常见的隐变量模型，它假设数据是由多个高斯分布组成的，每个数据点属于某个高斯分布，但我们不知道每个数据点具体属于哪个分布。EM算法可以用来估计模型参数。

假设数据 ( $\{x_1, x_2, \dots, x_n\}$ ) 来自两个高斯分布，每个高斯分布的参数为 ( $(\mu_1, \sigma_1^2)$ ) 和 ( $(\mu_2, \sigma_2^2)$ )。我们要估计这些参数。

E步：计算每个数据点属于每个高斯分布的概率（即隐变量的期望）：

给定当前参数 ( $(\mu_1^{(t)}, \sigma_1^{2(t)}, \mu_2^{(t)}, \sigma_2^{2(t)})$ )，我们计算每个数据点 ( $x_i$ ) 属于第一个高斯分布的概率（称为责任）：
$\gamma_{i1} = \frac{\pi_1 \mathcal{N}(x_i | \mu_1^{(t)}, \sigma_1^{2(t)})}{\pi_1 \mathcal{N}(x_i | \mu_1^{(t)}, \sigma_1^{2(t)}) + \pi_2 \mathcal{N}(x_i | \mu_2^{(t)}, \sigma_2^{2(t)})}$
其中， ( $\pi_1$ ) 和 ( $\pi_2$ ) 是高斯分布的混合系数， ( $\mathcal{N}(x_i | \mu, \sigma^2)$ ) 是高斯分布的概率密度函数。

M步：最大化E步得到的期望函数以更新参数：

根据责任 ( $\gamma_{i1}$ ) 和 ( $\gamma_{i2}$ )，我们更新高斯分布的均值和方差：
$\mu_1^{(t+1)} = \frac{\sum_{i=1}^{n} \gamma_{i1} x_i}{\sum_{i=1}^{n} \gamma_{i1}}$
$\sigma_1^{2(t+1)} = \frac{\sum_{i=1}^{n} \gamma_{i1} (x_i - \mu_1^{(t+1)})^2}{\sum_{i=1}^{n} \gamma_{i1}}$

类似地，我们也更新 ( $\mu_2$ ) 和 ( $\sigma_2^2$ )，并更新混合系数 ( $\pi_1$ ) 和 ( $\pi_2$ )。

Python代码实现

以下是使用EM算法估计GMM参数的简单Python代码示例：
具体解析见文末

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm# 生成数据：两个高斯分布混合的样本
np.random.seed(42)
n_samples = 500
data1 = np.random.normal(0, 1, n_samples//2)  # 均值为0，标准差为1
data2 = np.random.normal(5, 1, n_samples//2)  # 均值为5，标准差为1
X = np.concatenate([data1, data2])# 初始化参数
pi = np.array([0.5, 0.5])  # 混合系数
mu = np.array([0, 5])  # 均值
sigma = np.array([1, 1])  # 方差# EM算法迭代
n_iterations = 100
for iteration in range(n_iterations):# E步：计算责任（每个点属于每个分布的概率）resp1 = pi[0] * norm.pdf(X, mu[0], sigma[0])resp2 = pi[1] * norm.pdf(X, mu[1], sigma[1])total_resp = resp1 + resp2resp1 /= total_respresp2 /= total_resp# M步：最大化Q函数，更新参数pi[0] = np.mean(resp1)pi[1] = np.mean(resp2)mu[0] = np.sum(resp1 * X) / np.sum(resp1)mu[1] = np.sum(resp2 * X) / np.sum(resp2)sigma[0] = np.sqrt(np.sum(resp1 * (X - mu[0])**2) / np.sum(resp1))sigma[1] = np.sqrt(np.sum(resp2 * (X - mu[1])**2) / np.sum(resp2))# 打印每次迭代的结果print(f"Iteration {iteration+1}:")print(f"  pi: {pi}")print(f"  mu: {mu}")print(f"  sigma: {sigma}")# 可视化结果
x_vals = np.linspace(-5, 10, 1000)
y_vals = pi[0] * norm.pdf(x_vals, mu[0], sigma[0]) + pi[1] * norm.pdf(x_vals, mu[1], sigma[1])
plt.hist(X, bins=30, density=True, alpha=0.6, color='g')
plt.plot(x_vals, y_vals, color='r')
plt.title('EM Algorithm for GMM')
plt.show()

总结

Coordinate Ascent（坐标上升）算法是一种通过逐个优化坐标（变量）来最大化目标函数的方法。在EM算法中，E步和M步分别优化隐变量的期望和模型参数的最大化，这可以看作是一种坐标上升的过程。通过反复交替执行这两个步骤，EM算法最终能够逼近最优的模型参数。

英文版

What is the Coordinate Ascent Algorithm?

Coordinate Ascent is an optimization algorithm that improves one variable (or coordinate) at a time, while keeping the other variables fixed, gradually approaching the optimal solution. In contrast to Coordinate Descent, which focuses on minimizing the objective function, Coordinate Ascent is used to maximize the objective function, typically in situations where the optimization function is separable across dimensions.

In each iteration, the Coordinate Ascent algorithm selects a variable and maximizes the contribution of that variable to the objective function, then proceeds to the next variable for optimization. The process involves iteratively updating each coordinate to find the optimal solution. While this process may not always converge globally, it performs well in many practical problems, particularly when the objective function is simple, and each coordinate can be independently optimized.

Coordinate Ascent in the Context of the EM Algorithm

The Expectation-Maximization (EM) algorithm is a common optimization method used in statistics and machine learning, especially when dealing with hidden variables in the data. The basic idea of the EM algorithm is to iteratively solve the Expectation (E) step and Maximization (M) step to optimize the model parameters.

We can view the EM algorithm as a form of Coordinate Ascent. Each round of the E-step and M-step can be seen as optimizing different “coordinates,” with the E-step optimizing the expectation function, and the M-step maximizing that expectation.

Coordinate Ascent in the EM Algorithm

Assume we have a latent variable model with the goal of maximizing the log-likelihood function:
$\mathcal{L}(\theta) = \log P(X, Z | \theta)$
where:

( $X$ ) represents observed data,
( $Z$ ) represents latent variables,
( $\theta$ ) represents model parameters.

Since the latent variables ( $Z$ ) are unobserved, directly maximizing the log-likelihood function is difficult. Thus, the EM algorithm solves this problem iteratively.

E-step (Expectation step):
In the E-step, we compute the expected value of the latent variables, i.e., given the current parameter estimate ( $\theta^{(t)}$ ), the conditional probability distribution of the latent variables ( $Z$ ):
$Q(\theta, \theta^{(t)}) = \mathbb{E}_{Z | X, \theta^{(t)}} \left[ \log P(X, Z | \theta) \right]$
This is a coordinate ascent process because we maximize the expectation of the latent variables while holding the current parameters ( $\theta^{(t)}$ ) fixed.
M-step (Maximization step):
In the M-step, we maximize the expected function ( $Q(\theta, \theta^{(t)})$ ) computed in the E-step to update the model parameters ( $\theta$ ):
$\theta^{(t+1)} = \arg\max_\theta Q(\theta, \theta^{(t)})$
This is also a coordinate ascent process because we fix the expectation of the latent variables and optimize the parameters ( $\theta$ ).

The E-step and M-step alternate, with each step resembling an ascent along different coordinates. Through iterative updates of the parameters and the latent variables’ estimates, the EM algorithm moves towards the optimal solution.

Mathematical Formulas

Assume we have the following log-likelihood function:
$\mathcal{L}(\theta) = \log P(X | \theta) = \log \sum_{Z} P(X, Z | \theta)$

Since the latent variables ( Z ) are unobserved, in the E-step, we compute the conditional expectation of the latent variables:
$Q(\theta, \theta^{(t)}) = \mathbb{E}_{Z | X, \theta^{(t)}} \left[ \log P(X, Z | \theta) \right]$

Then, in the M-step, we maximize this expectation:
$\theta^{(t+1)} = \arg\max_\theta Q(\theta, \theta^{(t)})$

These two steps alternate until convergence.

Example: Gaussian Mixture Model (GMM)

A Gaussian Mixture Model (GMM) is a common latent variable model that assumes the data is generated from a mixture of several Gaussian distributions. Each data point belongs to one of the Gaussian distributions, but we don’t know which one. The EM algorithm can be used to estimate the model parameters.

Assume that the data ( $\{x_1, x_2, \dots, x_n\}$ ) comes from two Gaussian distributions, with parameters ( $(\mu_1, \sigma_1^2)$ ) and ( $(\mu_2, \sigma_2^2)$ ), and we aim to estimate these parameters.

E-step: Calculate the probability (responsibility) of each data point belonging to each Gaussian distribution:

Given the current parameters ( $(\mu_1^{(t)}, \sigma_1^{2(t)}, \mu_2^{(t)}, \sigma_2^{2(t)})$ ), we calculate the probability that data point ( $x_i$ ) belongs to the first Gaussian distribution (called responsibility):
$\gamma_{i1} = \frac{\pi_1 \mathcal{N}(x_i | \mu_1^{(t)}, \sigma_1^{2(t)})}{\pi_1 \mathcal{N}(x_i | \mu_1^{(t)}, \sigma_1^{2(t)}) + \pi_2 \mathcal{N}(x_i | \mu_2^{(t)}, \sigma_2^{2(t)})}$
where ( $\pi_1$ ) and ( $\pi_2$ ) are the mixing coefficients, and ( $\mathcal{N}(x_i | \mu, \sigma^2)$ ) is the Gaussian distribution’s probability density function.

M-step: Maximize the expected function obtained in the E-step to update the parameters:

Using the responsibilities ( $\gamma_{i1}$ ) and ( $\gamma_{i2}$ ), we update the Gaussian distribution’s mean and variance:
$\mu_1^{(t+1)} = \frac{\sum_{i=1}^{n} \gamma_{i1} x_i}{\sum_{i=1}^{n} \gamma_{i1}}$
$\sigma_1^{2(t+1)} = \frac{\sum_{i=1}^{n} \gamma_{i1} (x_i - \mu_1^{(t+1)})^2}{\sum_{i=1}^{n} \gamma_{i1}}$

Similarly, we update ( $\mu_2$ ), ( $\sigma_2^2$ ), and the mixing coefficients ( $\pi_1$ ) and ( $\pi_2$ ).

Python Code Implementation

Here is a simple Python implementation of the EM algorithm for estimating GMM parameters:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm# Generate data: samples from a mixture of two Gaussian distributions
np.random.seed(42)
n_samples = 500
data1 = np.random.normal(0, 1, n_samples//2)  # Mean 0, Std 1
data2 = np.random.normal(5, 1, n_samples//2)  # Mean 5, Std 1
X = np.concatenate([data1, data2])# Initialize parameters
pi = np.array([0.5, 0.5])  # Mixing coefficients
mu = np.array([0, 5])  # Means
sigma = np.array([1, 1])  # Variances# EM algorithm iterations
n_iterations = 100
for iteration in range(n_iterations):# E-step: Calculate responsibilities (probabilities of each point belonging to each distribution)resp1 = pi[0] * norm.pdf(X, mu[0], sigma[0])resp2 = pi[1] * norm.pdf(X, mu[1], sigma[1])total_resp = resp1 + resp2resp1 /= total_respresp2 /= total_resp# M-step: Maximize Q function, update parameterspi[0] = np.mean(resp1)pi[1] = np.mean(resp2)mu[0] = np.sum(resp1 * X) / np.sum(resp1)mu[1] = np.sum(resp2 * X) / np.sum(resp2)sigma[0] = np.sqrt(np.sum(resp1 * (X - mu[0])**2) / np.sum(resp1))sigma[1] = np.sqrt(np.sum(resp2 * (X - mu[1])**2) / np.sum(resp2))# Print results for each iterationprint(f"Iteration {iteration+1}:")print(f"  pi: {pi}")print(f"  mu: {mu}")print(f"  sigma: {sigma}")# Visualize the results
x_vals = np.linspace(-5, 10, 1000)
y_vals = pi[0] * norm.pdf(x_vals, mu[0], sigma[0]) + pi[1] * norm.pdf(x_vals, mu[1], sigma[1])
plt.hist(X, bins=30, density=True, alpha=0.6, color='g')
plt.plot(x_vals, y_vals, color='r')
plt.title('EM Algorithm for GMM')plt.show()

Summary

The Coordinate Ascent method is a useful iterative approach to optimization, where only one parameter (coordinate) is updated at a time while others remain fixed. This method is central to algorithms like Expectation-Maximization (EM), where the E-step and M-step alternately optimize different coordinates (latent variables and model parameters). The Gaussian Mixture Model (GMM) example illustrates how the EM algorithm can be used for model estimation in latent variable models.

python代码中哪一步体现数据的似然最大化？

在EM算法的M步中，“最大化对数似然函数” 这一目标体现在通过更新参数（混合系数 ( $\pi_k$ )、均值 ( $\mu_k$ ) 和方差 ( $\sigma_k$ )）来最大化 每个数据点的概率，从而最大化整个模型的 数据似然。下面详细解析这个过程，并明确如何体现最大化似然。

1. 对数似然函数的目标

对数似然函数表示的是数据给定当前模型参数下的概率。在高斯混合模型（GMM）中，假设我们有一个由 ( $K$ ) 个高斯分布组成的混合模型，数据集 ( $\{x_1, x_2, \dots, x_N\}$ ) 由这些高斯分布生成。

模型的对数似然函数是：

$\mathcal{L}(\theta) = \sum_{i=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i; \mu_k, \sigma_k^2) \right)$

其中：

( $\pi_k$ ) 是第 ( $k$ ) 个高斯分布的混合系数（即该高斯分布在总模型中的权重），
( $\mathcal{N}(x_i; \mu_k, \sigma_k^2)$ ) 是第 ( $k$ ) 个高斯分布对数据点 ( $x_i$ ) 的概率密度，
( $\mu_k$ ) 是第 ( $k$ ) 个高斯分布的均值，
( $\sigma_k^2$ ) 是第 ( $k$ ) 个高斯分布的方差。

对数似然函数衡量的是在给定数据的情况下，当前模型参数 ( $\pi_k, \mu_k, \sigma_k$ ) 下数据的 整体似然。我们的目标是最大化这个对数似然函数，找到能够最好地解释数据的模型参数。

2. M步的作用：最大化似然函数

在EM算法中，M步的目标是 最大化对数似然函数，即通过更新参数使得模型的似然值（数据的概率）最大化。如何做到这一点呢？

2.1. E步：计算责任（期望）

在E步中，我们计算每个数据点 ( $x_i$ ) 属于每个高斯分布 ( $k$ ) 的责任，即数据点 ( $x_i$ ) 属于第 ( $k$ ) 个分布的后验概率：

$\gamma_{ik} = P(z_i = k | x_i) = \frac{\pi_k \mathcal{N}(x_i; \mu_k, \sigma_k^2)}{\sum_{k'=1}^{K} \pi_{k'} \mathcal{N}(x_i; \mu_{k'}, \sigma_{k'}^2)}$

这里，( $\gamma_{ik}$ ) 表示数据点 ( $x_i$ ) 属于高斯分布 ( $k$ ) 的概率（责任）。

2.2. M步：更新参数

E步计算得到的责任 ( $\gamma_{ik}$ ) 用来在M步中更新模型的参数。通过更新混合系数 ( $\pi_k$ )、均值 ( $\mu_k$ ) 和方差 ( $\sigma_k$ )，我们让每个数据点的似然（由它属于每个高斯分布的概率加权计算的）最大化。

更新混合系数 ( $\pi_k$ )：混合系数表示每个高斯分布的权重。更新规则为：

$\pi_k = \frac{1}{N} \sum_{i=1}^{N} \gamma_{ik}$

这里的更新表示的是：每个高斯分布的权重是所有数据点在该分布下的责任（概率）的平均值。

更新均值 ( $\mu_k$ )：均值是高斯分布的中心，通过对数据点的加权平均来更新。更新规则为：

$\mu_k = \frac{\sum_{i=1}^{N} \gamma_{ik} x_i}{\sum_{i=1}^{N} \gamma_{ik}}$

这里的加权平均意味着：每个数据点 ( $x_i$ ) 对均值的贡献是它属于该高斯分布的概率 ( $\gamma_{ik}$ )。

更新方差 ( $\sigma_k^2$ )：方差衡量高斯分布的扩散程度，同样通过加权平方差来更新。更新规则为：

$\sigma_k^2 = \frac{\sum_{i=1}^{N} \gamma_{ik} (x_i - \mu_k)^2}{\sum_{i=1}^{N} \gamma_{ik}}$

同样地，这里加权平方差意味着：每个数据点 ( $x_i$ ) 对方差的贡献是它属于该高斯分布的概率 ( $\gamma_{ik}$ )。

3. 最大化似然：如何体现

最大化对数似然函数是通过更新参数（( $\pi_k, \mu_k, \sigma_k$ )）使得每个数据点的似然最大化来实现的。在M步中：

更新混合系数 ( $\pi_k$ )：通过更新每个高斯分布的权重，使得每个数据点属于各个高斯分布的责任概率最大化。
更新均值 ( $\mu_k$ )：通过更新每个高斯分布的均值，使得数据点在该分布下的概率密度最大化。
更新方差 ( $\sigma_k$ )：通过更新每个高斯分布的方差，使得数据点在该分布下的概率密度最大化。

最终，EM算法会通过迭代E步和M步，逐步优化参数，直到对数似然函数收敛，即模型的似然最大化。

M步通过计算每个数据点在E步中得到的责任，更新模型的参数（混合系数 ( $\pi_k$ )、均值 ( $\mu_k$ ) 和方差 ( $\sigma_k$ )），使得数据在当前模型下的似然最大化。这是通过加权平均和加权平方差的方式来进行的，每次更新都朝着使数据在当前模型下最可能的方向调整参数。

后记

2024年12月28日14点12分于上海，在GPT4o大模型辅助下完成。

中文版