RLHF人类反馈强化学习详解

🎙️ 语音朗读 当前: 晓晓 (温柔女声)

RLHF人类反馈强化学习详解

RLHF(Reinforcement Learning from Human Feedback)是2022年最热门的AI技术之一,它让大模型从”语言续写器”进化为”智能助手”。本文将深入剖析RLHF的每个技术环节。

1. 为什么需要RLHF

传统语言模型的训练目标是预测下一个token:

$$\mathcal{L}{LM} = -\sum{t=1}^{T} \log P(x_t | x_{<t})$$

这个目标函数只关心”文本是否像人类写的”,而不关心”是否有用、安全、准确”。RLHF的核心思想是引入人类偏好作为额外的优化信号。

2. RLHF三阶段架构

阶段一:监督微调(SFT)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# SFT训练 - 在人工编写的指令数据上微调
class SFTTrainer:
def __init__(self, model, tokenizer, learning_rate=5e-6):
self.model = model
self.tokenizer = tokenizer
self.optimizer = AdamW(model.parameters(), lr=learning_rate)

def train_step(self, batch):
input_ids = batch["input_ids"]
labels = batch["labels"] # -100用于mask prompt部分

outputs = self.model(input_ids=input_ids, labels=labels)
loss = outputs.loss

loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
self.optimizer.step()
self.optimizer.zero_grad()

return loss.item()

# 数据格式示例
sft_example = {
"input_ids": tokenizer.encode(
"### Human: 解释什么是梯度下降\n### Assistant: "
),
"labels": [-100, -100, -100, -100, -100] + tokenizer.encode(
"梯度下降是一种优化算法,通过沿着梯度反方向迭代更新参数..."
)
}

阶段二:奖励模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class RewardModelTrainer:
def __init__(self, model, lr=1e-5):
self.model = model # 基于SFT模型初始化
self.optimizer = AdamW(model.parameters(), lr=lr)

def train_step(self, batch):
"""
batch包含:
- query: 用户输入
- chosen: 人类偏好的回答
- rejected: 人类不偏好的回答
"""
chosen_rewards = self.model(batch["query"], batch["chosen"])
rejected_rewards = self.model(batch["query"], batch["rejected"])

# Bradley-Terry偏好损失
loss = -torch.log(
torch.sigmoid(chosen_rewards - rejected_rewards)
).mean()

loss.backward()
self.optimizer.step()
return loss.item()

def train_on_rankings(self, batch):
"""
处理多个回答的排序(不只是pairwise比较)
使用Elo评分系统的变体
"""
K = len(batch["responses"]) # 回答数量
rewards = []
for resp in batch["responses"]:
r = self.model(batch["query"], resp)
rewards.append(r)

rewards = torch.stack(rewards)
rankings = batch["rankings"] # 人工排序

loss = 0
for i in range(K):
for j in range(K):
if rankings[i] < rankings[j]: # i排名更高
loss -= torch.log(
torch.sigmoid(rewards[i] - rewards[j])
)
return loss / (K * (K-1) / 2)

阶段三:PPO强化学习

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
class RLHF_PPO:
def __init__(self, policy, ref_model, reward_model, value_model):
self.policy = policy
self.ref_model = ref_model
self.reward_model = reward_model
self.value_model = value_model

# 超参数
self.kl_coef = 0.2 # KL惩罚系数
self.gamma = 1.0 # 折扣因子
self.lam = 0.95 # GAE参数
self.clip_range = 0.2 # PPO裁剪范围
self.vf_coef = 0.1 # 价值函数损失系数
self.ent_coef = 0.01 # 熵正则系数

def generate_and_score(self, queries):
"""生成回答并计算各项指标"""
with torch.no_grad():
# 策略模型生成
responses = self.policy.generate(
queries,
max_length=512,
do_sample=True,
temperature=0.7
)

# 奖励模型评分
rm_scores = self.reward_model(queries, responses)

# KL散度计算
policy_logp = self.policy.log_prob(queries, responses)
ref_logp = self.ref_model.log_prob(queries, responses)
kl_div = (policy_logp - ref_logp).mean()

# 综合奖励
rewards = rm_scores - self.kl_coef * kl_div

# 价值函数估计
values = self.value_model(queries, responses)

return responses, rewards, values, policy_logp

def compute_advantages(self, rewards, values):
"""计算GAE优势估计"""
advantages = []
gae = 0
next_value = 0

for t in reversed(range(len(rewards))):
delta = rewards[t] + self.gamma * next_value - values[t]
gae = delta + self.gamma * self.lam * gae
advantages.insert(0, gae)
next_value = values[t]

advantages = torch.tensor(advantages)
return (advantages - advantages.mean()) / (advantages.std() + 1e-8)

def ppo_update(self, queries, responses, old_logprobs,
rewards, values, advantages):
"""PPO策略更新"""
for epoch in range(4): # PPO epoch数
new_logprobs = self.policy.log_prob(queries, responses)
ratio = torch.exp(new_logprobs - old_logprobs)

# 策略损失(裁剪)
surr1 = ratio * advantages
surr2 = torch.clamp(
ratio,
1 - self.clip_range,
1 + self.clip_range
) * advantages
policy_loss = -torch.min(surr1, surr2).mean()

# 价值函数损失
new_values = self.value_model(queries, responses)
value_loss = F.mse_loss(new_values, rewards)

# 熵奖励
entropy = self.policy.entropy(queries, responses).mean()

# 总损失
total_loss = (
policy_loss
+ self.vf_coef * value_loss
- self.ent_coef * entropy
)

total_loss.backward()
torch.nn.utils.clip_grad_norm_(
self.policy.parameters(), 0.5
)
self.optimizer.step()
self.optimizer.zero_grad()

3. RLHF的关键挑战

3.1 奖励模型过拟合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 诊断奖励模型过拟合
def evaluate_rm_overfitting(rm, eval_data):
"""在验证集上检查RM的校准情况"""
correct = 0
total = 0
for batch in eval_data:
chosen_score = rm(batch["query"], batch["chosen"])
rejected_score = rm(batch["query"], batch["rejected"])
if chosen_score > rejected_score:
correct += 1
total += 1

accuracy = correct / total
print(f"RM准确率: {accuracy:.2%}")

if accuracy > 0.95:
print("警告: 可能存在过拟合!")
return accuracy

3.2 奖励注入(Reward Hacking)

策略模型可能找到RM的漏洞而非真正改善输出质量:

1
2
3
症状:RM分数持续上升,但人类评估质量下降
原因:策略模型发现了RM的评分漏洞
解决:定期更新RM数据、增加KL惩罚、人工抽检

3.3 标注一致性

不同标注员对同一输出的偏好可能不同,解决方案:

  • 使用详细的标注指南
  • 多人标注取多数
  • 引入Elo评分系统

4. RLHF的替代方案

方法 优点 缺点
RLHF 效果显著 标注成本高
RLAIF 成本低 AI可能传播偏见
DPO 无需单独训练RM 灵活性较低
Constitutional AI 可扩展 需要精心设计原则

总结

RLHF是大模型对齐人类偏好的关键技术。它通过三阶段训练流程(SFT→RM→PPO),有效地将语言模型从文本生成器转变为有用的AI助手。理解RLHF的原理和实现细节,对于从事大模型开发和应用的工程师来说至关重要。

© 2019-2026 ovo$^{mc^2}$ All Rights Reserved. | 站点总访问 28969 次 | 访客 19045
Theme by hiero