PPO GRPO GSPO DAPO的Loss计算与代码实现

news/2025/10/20 17:02:28/文章来源:https://www.cnblogs.com/qlhh/p/19153109

首先看一下KL的基础公式

KL

KL1:

大模型的KL一般是反向的:

\[KL(\pi_\theta||\pi_{ref}) = E_{x\sim\pi_\theta(\cdot|o_{<t})}log\frac{\pi_\theta(x|o_{<t})}{\pi_{ref}(x|o_{<t})} \]

\(x\sim\pi_\theta(\cdot|o_{<t})\) 代表 当前模型根据前t-1个token采样得到第t个token x

KL3(GRPO使用的无偏,低方差KL1估计) http://joschu.net/blog/kl-approx.html:

\[KL(\pi_\theta||\pi_{ref}) = \mathbb{E}_{x\sim\pi_\theta(\cdot|o_{<t})}\frac{\pi_{ref}}{\pi_{\theta}} - log(\frac{\pi_{ref}}{\pi_{\theta}})-1 \]

  • 正向KL:倾向于使模型分布 Q 覆盖目标分布 P 的所有支持点,适合于需要模型分布更广泛覆盖的情况。
  • 反向KL:倾向于使模型分布 Q 集中在目标分布 P 的高概率区域,适合于生成任务,能够提高生成样本的质量和稳定性。

因此,在大语言模型和生成任务中,反向KL通常更受青睐。

不同RL算法 loss的计算

对于q的第\(i\)个sample的第\(t\)个token的loss: \(loss_{i,t}=pg\_loss_{i,t}+entropy\_loss_{i, t}+kl\_loss_{i,t}\)

再对一个batch中所有的token loss \(loss_{i,t}\)做聚合agg,得到这个batch的整体loss,可用于后续的反向传播和模型更新。

每个token的loss \(pg\_loss_{i,t}\) \(kl\_loss_{i,t}\) loss agg mode
PPO \(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A_{i,t})\) \(r_t=-\mathbb{D1}_{KL}(\pi_{old}||\pi_{ref})+r_t\) \(\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}loss_{i,t}\)
seq-mean-token-mean
Dual-clip PPO for A<0,
\(\min(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A), clip\_c*-A)\)
\(r_t=-\mathbb{D1}_{KL}(\pi_{old}||\pi_{ref})+r_t\) \(\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}loss_{i,t}\)
seq-mean-token-mean
GRPO \(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A_{i,t})\) \(\beta*\mathbb{D3}_{KL}(\pi_{\theta}||\pi_{ref})\) \(\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}loss_{i,t}\)
seq-mean-token-mean
GSPO \(IS_{i,t} = sg[\frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)}]*\frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{sg[\pi_{\theta}(o_{i,t}|q,o_{i,<t})]}\)
\(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A_{i,t})\)
\(\beta*\mathbb{D3}_{KL}(\pi_{\theta}||\pi_{ref})\) \(\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}loss_{i,t}\)
seq-mean-token-mean
DAPO \(\max(IS_{i,t}*-A_{i,t},clip(IS_{i,t})*-A_{i,t})\) \(\beta*\mathbb{D3}_{KL}(\pi_{\theta}||\pi_{ref})\) \(\frac{1}{\sum_{i=1}^G|o_i|}\sum_{i=1}^G\sum_{t=1}^{|o_i|}loss_{i,t}\)
token-mean

PPO

优化目标:

\[J = \mathbb{E}_{o\sim\pi_{old}}\frac{1}{|o|}\sum_{i=1}^{|o|} [\min(\frac{\pi_{\theta}(o_i|o_{<i}, q)}{\pi_{old}(o_i|o_{<i}, q)}A_i, clip(\frac{\pi_{\theta}(o_i|o_{<i}, q)}{\pi_{old}(o_i|o_{<i}, q)}, 1-\epsilon, 1+\epsilon)A_i] \]

优势: GAE
递推公式,t步的累积优势=t步的优势+ t+1步的累积优势=t步及之后 每一步的优势=t步及之后所有的奖励-第t步的预计奖励

\[\begin{aligned} A_t &= (r_t+\gamma V_{t+1}-V_t)+\gamma A_{t+1}\\ A_t &= \sum_{i=t}^T \gamma ^{i-t}(r_t+\gamma V_{t+1}-V_t)\\ A_t &= r_t+\gamma r_{t+1}+\gamma^2 r_{t+2}+...+\gamma^{T-t}r_T-V_t\\ \end{aligned} \]

奖励:

\[r_t=\begin{cases}-KL(\pi_{old}||\pi_{ref}), &t\neq T \\ -KL(\pi_{old}||\pi_{ref})+RM(q,o_i), &t=T \end{cases} \]

verl/trainer/ppo/ray_trainer.py verl | 如何在奖励中添加KL惩罚项?

###################################################
# 将KL惩罚loss应用到reward中。原始的reward是[0, 0, 0, ..., RM(q,o_i)]
# return KL(\pi_old||\pi_{ref}) + reward
###################################################
def apply_kl_penalty(data: DataProto, kl_ctrl: core_algos.AdaptiveKLController, kl_penalty="kl"):"""Apply KL penalty to the token-level rewards.This function computes the KL divergence between the reference policy and current policy,then applies a penalty to the token-level rewards based on this divergence.Args:data (DataProto): The data containing batched model outputs and inputs.kl_ctrl (core_algos.AdaptiveKLController): Controller for adaptive KL penalty.kl_penalty (str, optional): Type of KL penalty to apply. Defaults to "kl".Returns:tuple: A tuple containing:- The updated data with token-level rewards adjusted by KL penalty- A dictionary of metrics related to the KL penalty"""response_mask = data.batch["response_mask"]token_level_scores = data.batch["token_level_scores"]batch_size = data.batch.batch_size[0]# compute kl between ref_policy and current policy# When apply_kl_penalty, algorithm.use_kl_in_reward=True, so the reference model has been enabled.kld = core_algos.kl_penalty(data.batch["old_log_probs"], data.batch["ref_log_prob"], kl_penalty=kl_penalty)  # (batch_size, response_length)kld = kld * response_maskbeta = kl_ctrl.valuetoken_level_rewards = token_level_scores - beta * kld

KL

\[KL(\pi_{old}||\pi_{ref}) = log(\frac{\pi_{old}(o_t|q, o_{<t})}{\pi_{ref}(o_t|q, o_{<t})}) \]

PPO的KL散度是old到ref的

PPO的代码实现详见下面的Dual-clip PPO(PPO的改进版)

Dual-clip PPO

https://arxiv.org/pdf/1912.09729:对A<0的token的重要性采样IS做clip

image-20251020144504938

论文发现当A<0时,重要性采样的比值*A可以是负无穷,这会导致训练不稳定(梯度爆炸)的现象,因此在ppo的clip上,对于A<0又进一步添加了新的clip (clip_ratio_c)。

\[\mathrm{per\ token\ objection} = \begin{cases} \min(IS*A, clip(IS, 1-\epsilon, 1+\epsilon)*A), &A\geq0\\ \max(\min(IS*A, clip(IS, 1-\epsilon, 1+\epsilon)*A), clip\_ratio\_c*A), &A<0\\ \end{cases} \]

代码:

整体的ppo_loss是由pg_loss + kl_loss + entropy_loss构成,不同的RL方法pg_loss, kl_loss的计算方法是不同的。

  • pg_loss:具体于verl/trainer/ppo/core_algos.py(我将在dual-clip ppo和gspo部分介绍对应的pg_loss代码)。
  • kl_loss:同样位于verl/trainer/ppo/core_algos.py(我将会在grpo部分介绍具体的low_var_kl代码)。

verl/verl/workers/roles/utils/losses.py: ppo_loss的计算

######################################################
# 此函数用于计算整体的actor loss
######################################################
def ppo_loss(config: ActorConfig, model_output, data: TensorDict, dp_group=None):log_prob = model_output["log_probs"]entropy = model_output.get("entropy", None)log_prob = no_padding_2_padding(log_prob, data)  # (bsz, response_length)if entropy is not None:entropy = no_padding_2_padding(entropy, data)  # (bsz, response_length)metrics = {}response_mask = data["response_mask"].to(bool)# compute policy lossold_log_prob = data["old_log_probs"]advantages = data["advantages"]loss_agg_mode = config.loss_agg_modeloss_mode = config.policy_loss.get("loss_mode", "vanilla")policy_loss_fn = get_policy_loss_fn(loss_mode)# 调用下面的计算pg_loss的代码框pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = policy_loss_fn(old_log_prob=old_log_prob,log_prob=log_prob,advantages=advantages,response_mask=response_mask,loss_agg_mode=loss_agg_mode,config=config,)metrics.update({"pg_loss": pg_loss.detach().item(),"pg_clipfrac": pg_clipfrac.detach().item(),"ppo_kl": ppo_kl.detach().item(),"pg_clipfrac_lower": pg_clipfrac_lower.detach().item(),})policy_loss = pg_loss# 是否使用entropy loss# add entropy lossif entropy is not None:entropy_loss = agg_loss(loss_mat=entropy, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)entropy_coeff = config.entropy_coeff# token的entropy越大越好,而loss是越小越好,因此是 减去 entropypolicy_loss -= entropy_coeff * entropy_loss# 是否使用KL loss(grpo/gspo使用,ppo/dapo不使用)# add kl lossif config.use_kl_loss:ref_log_prob = data["ref_log_prob"]# compute kl losskld = kl_penalty(logprob=log_prob, ref_logprob=ref_log_prob, kl_penalty=config.kl_loss_type)kl_loss = agg_loss(loss_mat=kld, loss_mask=response_mask, loss_agg_mode=config.loss_agg_mode)policy_loss += kl_loss * config.kl_loss_coefmetrics["kl_loss"] = kl_loss.detach().item()metrics["kl_coef"] = config.kl_loss_coefreturn policy_loss, metrics

verl/trainer/ppo/core_algos.py不同的RL方法计算pg_loss是不同的,这里的是ppo的pg_loss,后面还会介绍gspo的pg_loss的实现。

######################################################
# 此函数用于计算pg_loss,并不计算KL惩罚项
######################################################
@register_policy_loss("vanilla")  # type: ignore[arg-type]
def compute_policy_loss_vanilla(old_log_prob: torch.Tensor,log_prob: torch.Tensor,advantages: torch.Tensor,response_mask: torch.Tensor,loss_agg_mode: str = "token-mean",config: Optional[DictConfig | AlgoConfig] = None,rollout_is_weights: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:"""Compute the clipped policy objective and related metrics for PPO.Adapted fromhttps://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122Args:old_log_prob (torch.Tensor):Log-probabilities of actions under the old policy, shape (batch_size, response_length).log_prob (torch.Tensor):Log-probabilities of actions under the current policy, shape (batch_size, response_length).advantages (torch.Tensor):Advantage estimates for each action, shape (batch_size, response_length).response_mask (torch.Tensor):Mask indicating which tokens to include in the loss, shape (batch_size, response_length).loss_agg_mode (str, optional):Aggregation mode for `agg_loss`. Defaults to "token-mean".config: `(verl.trainer.config.ActorConfig)`:config for the actor.rollout_log_probs: `(torch.Tensor)`:log probabilities of actions under the rollout policy, shape (batch_size, response_length)."""assert config is not Noneassert not isinstance(config, AlgoConfig)clip_ratio = config.clip_ratio  # Clipping parameter ε for standard PPO. See https://arxiv.org/abs/1707.06347.clip_ratio_low = config.clip_ratio_low if config.clip_ratio_low is not None else clip_ratioclip_ratio_high = config.clip_ratio_high if config.clip_ratio_high is not None else clip_ratioclip_ratio_c = config.get(  # Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729."clip_ratio_c", 3.0)cliprange = clip_ratiocliprange_low = clip_ratio_lowcliprange_high = clip_ratio_highassert clip_ratio_c > 1.0, ("The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0,"+ f" but get the value: {clip_ratio_c}.")# 计算每一个token的重要性采样的比值的log# log(\pi_{\theta}(o_{i,t}|q,o_{i,<t}))-log(\pi_{old}(o_{i,t}|q,o_{i<t}))negative_approx_kl = log_prob - old_log_prob# 对IS的log做clip,避免过大或过小# Clamp negative_approx_kl for stabilitynegative_approx_kl = torch.clamp(negative_approx_kl, min=-20.0, max=20.0)# 这里ratio是真正的IS 重要性采样ratio = torch.exp(negative_approx_kl)# 计算出-IS在token-level上的均值ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)####################################################### 下面开始计算pg_loss=#A>0, max(ratio*-A, clip(ratio, 1-\epsilon_low, 1+\epsilon_high)*-A)#A<0, min(max(ratio*-A, clip(ratio, 1-\epsilon_low, 1+\epsilon_high)*-A), clip_ratio_c*-A)######################################################pg_losses1 = -advantages * ratioif cliprange_low is None:cliprange_low = cliprangeif cliprange_high is None:cliprange_high = cliprange# clip后的losspg_losses2 = -advantages * torch.clamp(ratio, 1 - cliprange_low, 1 + cliprange_high)  # - clip(ratio, 1-cliprange, 1+cliprange) * A# ppo per token lossclip_pg_losses1 = torch.maximum(pg_losses1, pg_losses2)  # max(-ratio * A, -clip(ratio, 1-cliprange, 1+cliprange) * A)# 计算被才剪掉的token在 这个batch的所有未mask的token的比例(axis=None)【常数】pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)# 这里是dual-clip PPO提出,使用clip_ratio_c限制A<0的token的losspg_losses3 = -advantages * clip_ratio_c# min(max(ratio*-A, clip(ratio, 1-\epsilon_low, 1+\epsilon_high)*-A), clip_ratio_c*-A)clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)# 记录在传统ppo下,进一步裁减的A<0的IS大于clip_ratio_c的token在 这个batch的所有未mask的token的比例【常数】pg_clipfrac_lower = verl_F.masked_mean(torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask)# pg_losses是分段函数(记录每个token的loss),A<0时用clip_pg_losses2, A>=0时用clip_pg_losses1pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)# pg_losses: (bsz, response_length)# 如何计算一整个batch的所有token的整体loss。这有多种方式,主要看配置的loss_agg_modepg_loss = agg_loss(loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)return pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower

咱们继续看几种token loss的agg mode。不同RL方法,loss agg mode也是不同的

verl/trainer/ppo/core_algos.py

def agg_loss(loss_mat: torch.Tensor, loss_mask: torch.Tensor, loss_agg_mode: str):"""Aggregate the loss matrix into a scalar.Args:loss_mat: `(torch.Tensor)`:shape: (bs, response_length)loss_mask: `(torch.Tensor)`:shape: (bs, response_length)loss_agg_mode: (str) choices:method to aggregate the loss matrix into a scalar.Returns:loss: `a scalar torch.Tensor`aggregated loss"""if loss_agg_mode == "token-mean":loss = verl_F.masked_mean(loss_mat, loss_mask)elif loss_agg_mode == "seq-mean-token-sum":seq_losses = torch.sum(loss_mat * loss_mask, dim=-1)  # token-sumloss = torch.mean(seq_losses)  # seq-meanelif loss_agg_mode == "seq-mean-token-mean":seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) / torch.sum(loss_mask, dim=-1)  # token-meanloss = torch.mean(seq_losses)  # seq-meanelif loss_agg_mode == "seq-mean-token-sum-norm":seq_losses = torch.sum(loss_mat * loss_mask, dim=-1)loss = torch.sum(seq_losses) / loss_mask.shape[-1]  # The divisor# (loss_mask.shape[-1]) should ideally be constant# throughout training to well-replicate the DrGRPO paper.# TODO: Perhaps add user-defined normalizer argument to# agg_loss to ensure divisor stays constant throughout.else:raise ValueError(f"Invalid loss_agg_mode: {loss_agg_mode}")return loss

GRPO

优化目标:

\[J= \mathbb{E}_{\{o_i\}_{i=1}^G\sim\pi_{old}(\cdot|q)} \frac{1}{|G|} \sum_{i=1}^{|G|}\frac{1}{|o|}\sum_{t=1}^{|o_i|}\{\min[\frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{old}(o_{i,t}|q, o_{i, <t})}A_{i, t}, clip(\frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{old}(o_{i,t}|q, o_{i, <t})}, 1-\epsilon, 1+\epsilon)A_{i,t}]-\beta \mathbb{D}_{KL}(\pi_{\theta}||\pi_{ref})\} \]

优势:

\[A_{i,t} = \frac{r_i-mean(r)}{std(r)} \]

KL3

\[\mathbb{D}_{KL}(\pi_{\theta}||\pi_{ref}) =\frac{\pi_{ref}(o_{i, t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q, o_{i, <t})} -log(\frac{\pi_{ref}(o_{i, t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q, o_{i, <t})})-1 \]

KL3的方差比KL1小,且是KL1的无偏估计

证明

\[\begin{aligned} \mathbb{D3}_{KL}(P||Q) &= \sum_{x\sim_{P}}P(x) [\frac{Q(x)}{P(x)} - log(\frac{P(x)}{Q(x)})-1]\\ &= \sum_{x\sim P}Q(x)+P(x)log(\frac{P(x)}{Q(x)})-P(x)\\ &=\sum_{x\sim P}Q(x) -\sum_{x\sim P}P(x)+\mathbb{D1}_{KL}(P||Q) \\ &=\mathbb{D1}_{KL}(P||Q)+\sum_{x\sim P}Q(x)-1\ \ \ \ \ \ \ \ \ 当P所有采样在Q中的概率和为1时(vocab一样的话)\\ &=\mathbb{D_1}_{KL}(P||Q) \end{aligned} \]

verl/trainer/ppo/core_algos.py 下面是verl对kl_loss的实现:

def kl_penalty_forward(logprob: torch.FloatTensor, ref_logprob: torch.FloatTensor, kl_penalty) -> torch.FloatTensor:"""Compute KL divergence given logprob and ref_logprob.Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104See more description in http://joschu.net/blog/kl-approx.htmlArgs:logprob:ref_logprob:Returns:kl_estimate"""if kl_penalty in ("kl", "k1"):return logprob - ref_logprobif kl_penalty == "abs":return (logprob - ref_logprob).abs()if kl_penalty in ("mse", "k2"):return 0.5 * (logprob - ref_logprob).square()############################################################### 这里的low_var_kl与上述的grpo的KL计算公式相同############################################################### J. Schulman. Approximating kl divergence, 2020.# # URL http://joschu.net/blog/kl-approx.html.if kl_penalty in ("low_var_kl", "k3"):kl = ref_logprob - logprob# For numerical stabilitykl = torch.clamp(kl, min=-20, max=20)ratio = torch.exp(kl)kld = (ratio - kl - 1).contiguous()return torch.clamp(kld, min=-10, max=10)if kl_penalty == "full":# so, here logprob and ref_logprob should contain the logits for every token in vocabularyraise NotImplementedErrorraise NotImplementedError

GSPO

seq-level 优化目标:

\[J= \mathbb{E}_{\{o_i\}_{i=1}^G\sim\pi_{old}(\cdot|q)} \frac{1}{|G|} \sum_{i=1}^{|G|}\min[(\frac{\pi_{\theta}(o_{i}|q)}{\pi_{old}(o_{i}|q)})^{\frac{1}{|o_i|}}A_{i}, clip((\frac{\pi_{\theta}(o_{i}|q)}{\pi_{old}(o_{i}|q)})^{\frac{1}{|o_i|}}, 1-\epsilon, 1+\epsilon)A_{i}] \]

\[\frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)} = \frac{\Pi_{t=1}^{|o_i|} \pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\Pi_{t=1}^{|o_i|} \pi_{old}(o_{i,t}|q, o_{i,<t})} \]

token-level 优化目标:

\[J = \mathbb{E}_{\{o_i\}_{i=1}^G\sim \pi_{old}(\cdot|q)}\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \min(s_{i,t}A_{i,t}, clip(s_{i,t}, 1-\epsilon,1+\epsilon)A_{i,t})\\ \hat{s}_{i,t} = sg[(\frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)})^{\frac{1}{|o_i|}}]* \frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{sg[\pi_{\theta}(o_{i,t}|q,o_{i,<t})]} \]

可以发现的是 \(sg[s_{i,t}]=sg[s_{i}],s_{i}=(\frac{\pi_{\theta}(o_i|q)}{\pi_{old}(o_i|q)})^{\frac{1}{|o_i|}}\),但是在方向上不同

通过证明,可以发现,当\(A_{i,t}=A_i\)时,seq-level和token-level在前向传播和反向传播上是一样的
token-level 可以更好地扩展 同sample不同token的A的灵活度(每个token的A可以不相同)

verl/trainer/ppo/core_algos.py

##########################################################
# 计算gspo的pg_loss,重点关注IS的计算
##########################################################
@register_policy_loss("gspo")
def compute_policy_loss_gspo(old_log_prob: torch.Tensor,log_prob: torch.Tensor,advantages: torch.Tensor,response_mask: torch.Tensor,loss_agg_mode: str = "seq-mean-token-mean",config: Optional[DictConfig | ActorConfig] = None,rollout_is_weights: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:"""Compute the clipped policy objective and related metrics for GSPO.See https://arxiv.org/pdf/2507.18071 for more details.Args:old_log_prob (torch.Tensor):Log-probabilities of actions under the old policy, shape (batch_size, response_length).log_prob (torch.Tensor):Log-probabilities of actions under the current policy, shape (batch_size, response_length).advantages (torch.Tensor):Advantage estimates for each action, shape (batch_size, response_length).response_mask (torch.Tensor):Mask indicating which tokens to include in the loss, shape (batch_size, response_length).loss_agg_mode (str, optional):Aggregation mode for `agg_loss`. For GSPO, it is recommended to use "seq-mean-token-mean"."""assert config is not Noneassert isinstance(config, ActorConfig)clip_ratio_low = config.clip_ratio_low if config.clip_ratio_low is not None else config.clip_ratioclip_ratio_high = config.clip_ratio_high if config.clip_ratio_high is not None else config.clip_rationegative_approx_kl = log_prob - old_log_prob# compute sequence-level importance ratio:# si(θ) = (π_θ(yi|x)/π_θold(yi|x))^(1/|yi|) =# exp [(1/|y_i|) * Σ_t log(π_θ(y_i,t|x,y_i,<t)/π_θold(y_i,t|x,y_i,<t))]seq_lengths = torch.sum(response_mask, dim=-1).clamp(min=1)negative_approx_kl_seq = torch.sum(negative_approx_kl * response_mask, dim=-1) / seq_lengths# Combined ratio at token level:# s_i,t(θ) = sg[s_i(θ)] · π_θ(y_i,t|x, y_i,<t) / sg[π_θ(y_i,t|x, y_i,<t)]# In log space: log(s_i,t(θ)) = sg[log(s_i(θ))] + log_prob - sg[log_prob]log_seq_importance_ratio = log_prob - log_prob.detach() + negative_approx_kl_seq.detach().unsqueeze(-1)log_seq_importance_ratio = torch.clamp(log_seq_importance_ratio, max=10.0)  # clamp for numerical stability# finaly exp() to remove logseq_importance_ratio = torch.exp(log_seq_importance_ratio)pg_losses1 = -advantages * seq_importance_ratiopg_losses2 = -advantages * torch.clamp(seq_importance_ratio, 1 - clip_ratio_low, 1 + clip_ratio_high)pg_losses = torch.maximum(pg_losses1, pg_losses2)# Apply rollout importance sampling weights if providedif rollout_is_weights is not None:pg_losses = pg_losses * rollout_is_weights# for GSPO, we need to aggregate the loss at the sequence level (seq-mean-token-mean)pg_loss = agg_loss(loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode="seq-mean-token-mean")# For compatibility, return zero for pg_clipfrac_lower (not used in standard GSPO)pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)pg_clipfrac_lower = torch.tensor(0.0, device=pg_loss.device)ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)return pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower

DAPO

优化目标:

\[\mathcal{J} = \mathbb{E}_{(q,a)\sim \mathcal{D}, \{o_i\}_{i=1}^G\sim \pi_{old}(\cdot|q)} [\frac{1}{\sum_{i=1}^G|o_i|}\sum_{i=1}^G\sum_{t=1}^{|o_i|}\min(r_{i,t}(\theta)A_{i, t}, clip(r_{i,t}(\theta),1-\epsilon_{low}, 1+\epsilon_{high})A_{i,t})]\\ s.t.\ 0<|\{o_i|is\_equivalent(o_i,a)\}|<G \]

其中

\[r_{i,t}(\theta)=\frac{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}{\pi_{old}(o_{i,t}|q,o_{i,<t})}, A_{i,t} = \frac{R_i-mean(\{R_i\}_{i=1}^G)}{std(\{R_i\}_{i=1}^G)} \]

其loss agg mode是token-mean。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/941381.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

P3601 签到题

// 容易注意到 qiandao(i) = i - phi(i) // phi 是欧拉函数// 让我们想起最开始求欧拉函数的做法 // 分解质因数, 然后使用 phi(x) = x * 求积_{p in {x 的所有质因数}} (1 - 1 / p) // 这样的时间复杂度显然过大// 我…

图像采集卡重要功能解析:打通视频信号处理全链路

在视频采集与处理的产业链中,图像采集卡是连接前端设备与后端计算机的关键枢纽,其功能覆盖信号转换、接口适配、格式兼容等多个重要环节,为直播、监控、影视制作等场景提供稳定高效的技术支撑。 一、视频信号转换:…

2025年铣边机/铣床/刨边机/滚轮架/变位机厂家推荐排行榜,专业实力与市场口碑深度解析

2025年铣边机/铣床/刨边机/滚轮架/变位机厂家推荐排行榜,专业实力与市场口碑深度解析 随着制造业向智能化、精密化方向快速发展,铣边机、铣床、刨边机、滚轮架、变位机等关键设备在工业生产中的重要性日益凸显。这些…

[Ubuntu]在windows系统上下载chrome browser .deb 文件

https://www.google.com/chrome/?platform=linuxTo download Chrome browser for the enterprise:Go to the Chrome browser packages download page and click Download Chrome. Download the package for your Linu…

详细介绍:php+vue新疆数字证书认证政府中心网站建设

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

2025年机械加工厂家推荐排行榜,钣金加工,焊接件加工,零件加工,天文台圆顶加工,非标自动化设备加工设计,精密钣金加工,精密零件加工,金属加工公司推荐

2025年机械加工厂家推荐排行榜:精密制造领域的权威指南 随着制造业向智能化、精密化方向快速发展,机械加工行业正经历着深刻变革。作为工业制造的基础支撑,机械加工、钣金加工、焊接件加工、零件加工等技术领域对产…

A3979

两相四线步进电机的驱动方法/驱动芯片用法_两相四线步进电机驱动芯片-CSDN博客

基于物理信息神经网络(PINN)求解二维稳态对流-扩散方程的MATLAB构建

基于物理信息神经网络(PINN)求解二维稳态对流-扩散方程的MATLAB构建2025-10-20 16:56 tlnshuju 阅读(0) 评论(0) 收藏 举报pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto…

UOJ #1005. 【UR #32】王之钦定 题解

Description 跳蚤国计算机协会 UOI 主席 “王中王” 认为 UOI 决赛不具有观赏性。 比如蟋蟀国的比赛,选手都需要在初赛快速 AK 才能晋级决赛,但 UOI 决赛只需要通过不到一半的题目就可以获得三十二强。 但是经过 UOI…

《C++ string类深度解析:核心接口全方位精讲与掌握》 - 指南

《C++ string类深度解析:核心接口全方位精讲与掌握》 - 指南pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Cons…

HL7v3和RIM是什么,和传统HL7,FHIR有什么关系

简单来说,可以把它们看作是医疗信息标准演进的三个主要阶段:HL7 V2.x(传统HL7):“实用主义”的行业标准 - 像方言,灵活但不够严谨。 HL7 V3 RIM:“理想主义”的理论基础 - 试图创建一门完美的“世界语”的语法和…

2025 年防撞护栏生产厂家最新推荐排行榜:聚焦铝合金 / Q235/Q355B 桥梁等多类型护栏,精选优质企业

引言 当前交通基础设施建设不断推进,防撞护栏作为保障道路与桥梁安全的核心设施,市场需求持续增长。但市场上厂家数量繁杂,产品质量、生产能力和服务水平差异显著,采购方常面临原材料以次充好、产能不足延误工期、…

AtCoder AGC047 总结

AtCoder AGC047 总结 A 由于小数位最多九位,我们先乘 \(10^9\),转化为求 \(10^{18}\mid a_ia_j\) 的个数。 考虑分解质因数,要求 \(2,5\) 的次数都至少为 \(18\) 即可。时间 \(18^2\times n\)。 B 一个串可以变成的…

YAML

YAML《yaml在嵌入式软件开发中的作用》 YAML:轻量级的数据序列化语言。

QUALIFY 窗口过滤 - --

传统方式(使用CTE):sqlWITH ranked_data AS (SELECT user_id,ip,country_code,os,RANK() OVER (PARTITION BY user_id ORDER BY log_datetime DESC) AS previous_loginsFROM login_logs ) SELECT user_id, ip, coun…

【ffmpeg】开发过程中错误简单记录

最近遇到了些ffmpeg的错误,有些意想不到,有些也很低级,但很多遇到了很难排查到特此记录下。 1、avformat_new_stream 的 Function not implemented错误 打包放到实机的时候发现,无法开启录制,只要录制就报错Funct…

2025 定制家具厂家推荐榜:定制酒柜/定制房门/定制护墙板/定制吧台/定制装饰柜/定制鞋柜/聚焦个性化与环保,这家深圳企业成优选​

随着消费升级持续推进,85 后、90 后成为家居消费核心客群,定制家具因能适配不同户型空间与个性化需求,已从高端市场逐步普及至普通家庭。2025 年,在 “以旧换新” 政策拉动与旧房翻新需求释放下,定制家具市场规模…

Winform开发报表(锐浪推方式)

Grid++Report 的报表数据来源既可以是推(PUSH)模式,也可以是拉(PULL)模式。所谓推模式就是由报表宿主程序向报表填充数据,报表引擎处于被动接受数据的形式。所谓拉模式就是报表引擎根据报表的数据源取数参数主动从数…

2025年通风天窗厂家最新权威推荐榜:排烟天窗、通风气楼、屋顶通风器、顺坡气楼、10A通风天窗、1型通风天窗、TC5A通风天窗、TC12B通风天窗、屋脊通风天窗专业选购指南

2025年通风天窗厂家最新权威推荐榜:排烟天窗、通风气楼、屋顶通风器、顺坡气楼、10A通风天窗、1型通风天窗、TC5A通风天窗、TC12B通风天窗、屋脊通风天窗专业选购指南 随着工业建筑环境标准的不断提升,通风天窗作为工…

【LeetCode】125. 验证回文串

125. 验证回文串 如果在将所有大写字符转换为小写字符、并移除所有非字母数字字符之后,短语正着读和反着读都一样。则可以认为该短语是一个 回文串 。 字母和数字都属于字母数字字符。 给你一个字符串 s,如果它是 回…