梯度下降优化算法演进:从SGD到Adam

🎙️ 语音朗读 当前: 晓晓 (温柔女声)

梯度下降优化算法演进:从SGD到Adam

优化算法是深度学习训练的核心,从最基础的梯度下降到自适应学习率方法,优化算法经历了持续演进。

批量梯度下降(BGD)

BGD使用全部训练数据计算梯度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np

def batch_gradient_descent(X, y, lr=0.01, epochs=100):
n_samples, n_features = X.shape
weights = np.zeros(n_features)
bias = 0

for epoch in range(epochs):
# 使用全部样本计算梯度
y_pred = np.dot(X, weights) + bias
error = y_pred - y

dw = (1 / n_samples) * np.dot(X.T, error)
db = (1 / n_samples) * np.sum(error)

weights -= lr * dw
bias -= lr * db

return weights, bias

缺点:每次迭代需要遍历全部数据,计算量大,内存消耗高。

随机梯度下降(SGD)

SGD每次只使用一个样本更新参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def sgd(X, y, lr=0.01, epochs=100):
n_samples, n_features = X.shape
weights = np.zeros(n_features)
bias = 0

for epoch in range(epochs):
for i in range(n_samples):
y_pred = np.dot(X[i], weights) + bias
error = y_pred - y[i]

weights -= lr * error * X[i]
bias -= lr * error

return weights, bias

优点:更新速度快,有逃离局部最小值的能力。
缺点:方差大,收敛不稳定。

小批量梯度下降(Mini-batch SGD)

Mini-batch SGD是BGD和SGD的折中方案:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def mini_batch_sgd(X, y, batch_size=32, lr=0.01, epochs=100):
n_samples, n_features = X.shape
weights = np.zeros(n_features)
bias = 0

for epoch in range(epochs):
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]

for i in range(0, n_samples, batch_size):
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]

y_pred = np.dot(X_batch, weights) + bias
error = y_pred - y_batch

weights -= lr * np.dot(X_batch.T, error) / batch_size
bias -= lr * np.sum(error) / batch_size

return weights, bias

Momentum

Momentum通过引入动量项加速收敛:

1
2
3
4
5
6
7
8
9
10
11
12
13
class SGDWithMomentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.velocity = None

def update(self, params, grads):
if self.velocity is None:
self.velocity = np.zeros_like(params)

self.velocity = self.momentum * self.velocity - self.lr * grads
params += self.velocity
return params

AdaGrad

AdaGrad自适应地调整每个参数的学习率:

1
2
3
4
5
6
7
8
9
10
11
12
13
class AdaGrad:
def __init__(self, lr=0.01, eps=1e-8):
self.lr = lr
self.eps = eps
self.cache = None

def update(self, params, grads):
if self.cache is None:
self.cache = np.zeros_like(params)

self.cache += grads ** 2
params -= self.lr * grads / (np.sqrt(self.cache) + self.eps)
return params

缺点:学习率单调递减,可能导致训练过早停止。

RMSProp

RMSProp通过指数移动平均解决AdaGrad学习率递减过快的问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class RMSProp:
def __init__(self, lr=0.001, decay=0.9, eps=1e-8):
self.lr = lr
self.decay = decay
self.eps = eps
self.cache = None

def update(self, params, grads):
if self.cache is None:
self.cache = np.zeros_like(params)

self.cache = self.decay * self.cache + (1 - self.decay) * grads ** 2
params -= self.lr * grads / (np.sqrt(self.cache) + self.eps)
return params

Adam

Adam结合了Momentum和RMSProp的优点:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.m = None # 一阶矩估计
self.v = None # 二阶矩估计
self.t = 0 # 时间步

def update(self, params, grads):
if self.m is None:
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)

self.t += 1
self.m = self.beta1 * self.m + (1 - self.beta1) * grads
self.v = self.beta2 * self.v + (1 - self.beta2) * grads ** 2

# 偏差校正
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)

params -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
return params

学习率调度

合理的学习率调度对训练效果至关重要:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import torch
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, ReduceLROnPlateau

model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# 阶梯式衰减
scheduler_step = StepLR(optimizer, step_size=30, gamma=0.1)

# 余弦退火
scheduler_cosine = CosineAnnealingLR(optimizer, T_max=100)

# 自适应衰减
scheduler_plateau = ReduceLROnPlateau(optimizer, mode='min', patience=10)

优化算法选择建议

优化器 适用场景 优点 缺点
SGD + Momentum CV任务,追求最佳精度 泛化性好 需精心调参
Adam NLP任务,快速原型 收敛快,鲁棒 可能不收敛
AdamW Transformer模型 正确的权重衰减 需调整衰减系数
RAdam 训练初期不稳定 自适应预热 计算略复杂

总结

优化算法从简单的SGD演进到自适应的Adam系列,每种算法都有其适用场景。Adam因其自适应学习率和快速收敛成为默认选择,但SGD+Momentum在追求最佳泛化性能时仍然常用。合理配合学习率调度策略,可以进一步提升训练效果。

© 2019-2026 ovo$^{mc^2}$ All Rights Reserved. | 站点总访问 28969 次 | 访客 19045
Theme by hiero