REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Author: Jian Hu

First published on: 2024/12/26

Technical Resport https://arxiv.org/abs/2501.03262

<aside> 💡

[2025/8] https://arxiv.org/abs/2508.08221 shows that global advantage normalization in REINFORCE++ is better than group normalization in GRPO.

[2025/7] NVIDIA uses the REINFORCE++-baseline to train the reasoning models (ProRLv2).

[2025/6] Magistral uses the method quiet similar to REINFORCE++-baseline to train the reasoning models.

</aside>

We found that REINFORCE++-baseline is a very practical method in RLVR, and more details about it are in https://medium.com/@janhu9527/reinforce-baseline-is-all-you-need-in-rlvr-f5406930aa85.

Introduction

RLHF (Reinforcement Learning from Human Feedback) is rapidly evolving, with algorithms such as PPO, DPO, RLOO, ReMax and GRPO emerging one after another. By integrating various optimization techniques from Proximal Policy Optimization (PPO) into the traditional REINFORCE algorithm, we “proposed” REINFORCE++, which aims to enhance performance and stability in RLHF while reducing computational resource requirements without the critic network.

The key feature of REINFORCE++ is that it is more stable than GRPO and faster than PPO.

What is REINFORCE?

REINFORCE is a crucial and simple policy gradient method in reinforcement learning designed to maximize expected cumulative rewards through direct policy optimization. The algorithm operates based on Monte Carlo methods, following these key steps:

Policy Sampling: The agent interacts with the environment according to its current policy, generating a sequence of states, actions, and rewards (trajectories).
Return Calculation: For each trajectory, returns are computed using discounted cumulative rewards: $G_{t} = \sum_{k=t+1}^{T} \gamma^{k-t} r_{k}$

Here, $\gamma$ is the discount factor, and $r_{k}$ is the immediate reward at time step $k$.
Gradient Estimation: The policy gradient is calculated using Monte Carlo methods, updating the policy parameters $\theta$ as follows: $\nabla_{\theta} J(\theta) = \mathbb{E}{\pi}\left[G_{t} \nabla_{\theta} \log \pi_{\theta}(A_{t} | S_{t})\right]$
Policy Update: The policy parameters are updated using gradient ascent: $\theta_{t+1} = \theta_{t} + \alpha \nabla_{\theta} J(\theta)$

where $\alpha$ is the learning rate.

Key Implementation Tricks in REINFORCE++

To stabilize model training, several optimization tricks are integrated into REINFORCE:

Token Level KL-Penalty

The KL-Divergence between the response distributions of the RL model and the SFT model is calculated for each token. This divergence is then incorporated as a penalty term in the reward function during training. Specifically, the per-token reward is represented as follows:

$$ r(s_t, a_t) = \textbf{I}(s_t =[\text{EOS}])r(x,y)-\beta \text{KL}(t) \ \ \ (1) $$

$$ \text{KL}(t) = \log({\pi_{\theta_{\text{old}}}(a_t|s_t)^{\text{RL}}}/{\pi^{\text{SFT}}(a_t|s_t)}）\ \ \ (2) $$