Author: Jian Hu
First published on: 2025/10/30
As is well-known, during training processes like RLHF (Reinforcement Learning from Human Feedback) or RLVR (Reinforcement Learning from Verbal Reasoning), we typically use the Reverse $KL$ (Kullback-Leibler) divergence between a reference model ($\pi_{ref}$) and the current policy model ($\pi_{\theta}$) to constrain policy learning and improve training stability. (This blog does not discuss methods like DAPO that remove this KL constraint.)
In practice, the gradient of the KL divergence cannot be directly computed. Therefore, three approximation methods (k1, k2, k3) are commonly used to estimate the expectation and gradient of KL:
The Kullback-Leibler (KL) divergence from a distribution $q(x)$ to a reference $p(x)$ is defined as:
$D_{KL}(q||p)=\mathbb{E}_{x\sim q}[\log\frac{q(x)}{p(x)}]$
As this expectation is often intractable, it is estimated from Monte Carlo samples. Given the importance ratio $\delta(x)=p(x)/q(x)$, common estimators for the term within the expectation include:
What is Reverse KL? In simple terms, "Reverse KL" in RLHF means that the samples used to calculate the KL divergence are drawn from the current training policy $\pi$ (the $q$ distribution in the formula above), while the target distribution is the reference policy $\pi_{ref}$ (the $p$ distribution). Since we indeed sample from the current policy $\pi$ during RL training, we are generally referring to Reverse KL (RKL).
In non-critic algorithms like GRPO or REINFORCE++-baseline, which is better to use as the KL loss item: k1, k2, or k3?