梯度下降优化算法演进:从SGD到Adam
优化算法是深度学习训练的核心,从最基础的梯度下降到自适应学习率方法,优化算法经历了持续演进。
批量梯度下降(BGD)
BGD使用全部训练数据计算梯度:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| import numpy as np
def batch_gradient_descent(X, y, lr=0.01, epochs=100): n_samples, n_features = X.shape weights = np.zeros(n_features) bias = 0
for epoch in range(epochs): y_pred = np.dot(X, weights) + bias error = y_pred - y
dw = (1 / n_samples) * np.dot(X.T, error) db = (1 / n_samples) * np.sum(error)
weights -= lr * dw bias -= lr * db
return weights, bias
|
缺点:每次迭代需要遍历全部数据,计算量大,内存消耗高。
随机梯度下降(SGD)
SGD每次只使用一个样本更新参数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| def sgd(X, y, lr=0.01, epochs=100): n_samples, n_features = X.shape weights = np.zeros(n_features) bias = 0
for epoch in range(epochs): for i in range(n_samples): y_pred = np.dot(X[i], weights) + bias error = y_pred - y[i]
weights -= lr * error * X[i] bias -= lr * error
return weights, bias
|
优点:更新速度快,有逃离局部最小值的能力。
缺点:方差大,收敛不稳定。
小批量梯度下降(Mini-batch SGD)
Mini-batch SGD是BGD和SGD的折中方案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| def mini_batch_sgd(X, y, batch_size=32, lr=0.01, epochs=100): n_samples, n_features = X.shape weights = np.zeros(n_features) bias = 0
for epoch in range(epochs): indices = np.random.permutation(n_samples) X_shuffled = X[indices] y_shuffled = y[indices]
for i in range(0, n_samples, batch_size): X_batch = X_shuffled[i:i+batch_size] y_batch = y_shuffled[i:i+batch_size]
y_pred = np.dot(X_batch, weights) + bias error = y_pred - y_batch
weights -= lr * np.dot(X_batch.T, error) / batch_size bias -= lr * np.sum(error) / batch_size
return weights, bias
|
Momentum
Momentum通过引入动量项加速收敛:
1 2 3 4 5 6 7 8 9 10 11 12 13
| class SGDWithMomentum: def __init__(self, lr=0.01, momentum=0.9): self.lr = lr self.momentum = momentum self.velocity = None
def update(self, params, grads): if self.velocity is None: self.velocity = np.zeros_like(params)
self.velocity = self.momentum * self.velocity - self.lr * grads params += self.velocity return params
|
AdaGrad
AdaGrad自适应地调整每个参数的学习率:
1 2 3 4 5 6 7 8 9 10 11 12 13
| class AdaGrad: def __init__(self, lr=0.01, eps=1e-8): self.lr = lr self.eps = eps self.cache = None
def update(self, params, grads): if self.cache is None: self.cache = np.zeros_like(params)
self.cache += grads ** 2 params -= self.lr * grads / (np.sqrt(self.cache) + self.eps) return params
|
缺点:学习率单调递减,可能导致训练过早停止。
RMSProp
RMSProp通过指数移动平均解决AdaGrad学习率递减过快的问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| class RMSProp: def __init__(self, lr=0.001, decay=0.9, eps=1e-8): self.lr = lr self.decay = decay self.eps = eps self.cache = None
def update(self, params, grads): if self.cache is None: self.cache = np.zeros_like(params)
self.cache = self.decay * self.cache + (1 - self.decay) * grads ** 2 params -= self.lr * grads / (np.sqrt(self.cache) + self.eps) return params
|
Adam
Adam结合了Momentum和RMSProp的优点:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| class Adam: def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8): self.lr = lr self.beta1 = beta1 self.beta2 = beta2 self.eps = eps self.m = None self.v = None self.t = 0
def update(self, params, grads): if self.m is None: self.m = np.zeros_like(params) self.v = np.zeros_like(params)
self.t += 1 self.m = self.beta1 * self.m + (1 - self.beta1) * grads self.v = self.beta2 * self.v + (1 - self.beta2) * grads ** 2
m_hat = self.m / (1 - self.beta1 ** self.t) v_hat = self.v / (1 - self.beta2 ** self.t)
params -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps) return params
|
学习率调度
合理的学习率调度对训练效果至关重要:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| import torch from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, ReduceLROnPlateau
model = torch.nn.Linear(10, 2) optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler_step = StepLR(optimizer, step_size=30, gamma=0.1)
scheduler_cosine = CosineAnnealingLR(optimizer, T_max=100)
scheduler_plateau = ReduceLROnPlateau(optimizer, mode='min', patience=10)
|
优化算法选择建议
| 优化器 |
适用场景 |
优点 |
缺点 |
| SGD + Momentum |
CV任务,追求最佳精度 |
泛化性好 |
需精心调参 |
| Adam |
NLP任务,快速原型 |
收敛快,鲁棒 |
可能不收敛 |
| AdamW |
Transformer模型 |
正确的权重衰减 |
需调整衰减系数 |
| RAdam |
训练初期不稳定 |
自适应预热 |
计算略复杂 |
总结
优化算法从简单的SGD演进到自适应的Adam系列,每种算法都有其适用场景。Adam因其自适应学习率和快速收敛成为默认选择,但SGD+Momentum在追求最佳泛化性能时仍然常用。合理配合学习率调度策略,可以进一步提升训练效果。