Author: Jian Hu

First published on: 2025/10/30

As is well-known, during training processes like RLHF (Reinforcement Learning from Human Feedback) or RLVR (Reinforcement Learning from Verbal Reasoning), we typically use the Reverse $KL$ (Kullback-Leibler) divergence between a reference model ($\pi_{ref}$) and the current policy model ($\pi_{\theta}$) to constrain policy learning and improve training stability. (This blog does not discuss methods like DAPO that remove this KL constraint.)

Three KL Approximation Methods: k1, k2, k3

In practice, the gradient of the KL divergence cannot be directly computed. Therefore, three approximation methods (k1, k2, k3) are commonly used to estimate the expectation and gradient of KL:


Formal Definitions

The Kullback-Leibler (KL) divergence from a distribution $q(x)$ to a reference $p(x)$ is defined as:

$D_{KL}(q||p)=\mathbb{E}_{x\sim q}[\log\frac{q(x)}{p(x)}]$

As this expectation is often intractable, it is estimated from Monte Carlo samples. Given the importance ratio $\delta(x)=p(x)/q(x)$, common estimators for the term within the expectation include:

What is Reverse KL? In simple terms, "Reverse KL" in RLHF means that the samples used to calculate the KL divergence are drawn from the current training policy $\pi$ (the $q$ distribution in the formula above), while the target distribution is the reference policy $\pi_{ref}$ (the $p$ distribution). Since we indeed sample from the current policy $\pi$ during RL training, we are generally referring to Reverse KL (RKL).


The Core Question

In non-critic algorithms like GRPO or REINFORCE++-baseline, which is better to use as the KL loss item: k1, k2, or k3?

Gradient Analysis