Author: Jian Hu
First published on: 2024/12/26
Technical Resport https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS
RLHF (Reinforcement Learning from Human Feedback) is rapidly evolving, with algorithms such as PPO, DPO, RLOO, ReMax and GRPO emerging one after another. By integrating various optimization techniques from Proximal Policy Optimization (PPO) into the traditional REINFORCE algorithm, we “proposed” REINFORCE++, which aims to enhance performance and stability in RLHF while reducing computational resource requirements without the critic network.
The key feature of REINFORCE++ is that it is more stable than GRPO and faster than PPO.
REINFORCE is a crucial and simple policy gradient method in reinforcement learning designed to maximize expected cumulative rewards through direct policy optimization. The algorithm operates based on Monte Carlo methods, following these key steps:
Policy Sampling: The agent interacts with the environment according to its current policy, generating a sequence of states, actions, and rewards (trajectories).
Return Calculation: For each trajectory, returns are computed using discounted cumulative rewards: $G_{t} = \sum_{k=t+1}^{T} \gamma^{k-t} r_{k}$
Here, $\gamma$ is the discount factor, and $r_{k}$ is the immediate reward at time step $k$.
Gradient Estimation: The policy gradient is calculated using Monte Carlo methods, updating the policy parameters $\theta$ as follows: $\nabla_{\theta} J(\theta) = \mathbb{E}{\pi}\left[G_{t} \nabla_{\theta} \log \pi_{\theta}(A_{t} | S_{t})\right]$
Policy Update: The policy parameters are updated using gradient ascent: $\theta_{t+1} = \theta_{t} + \alpha \nabla_{\theta} J(\theta)$
where $\alpha$ is the learning rate.
To stabilize model training, several optimization tricks are integrated into REINFORCE:
The KL-Divergence between the response distributions of the RL model and the SFT model is calculated for each token. This divergence is then incorporated as a penalty term in the reward function during training. Specifically, the per-token reward is represented as follows:
$$ r(s_t, a_t) = \textbf{I}(s_t =[\text{EOS}])r(x,y)-\beta \text{KL}(t) \ \ \ (1) $$
$$ \text{KL}(t) = \log({\pi_{\theta_{\text{old}}}(a_t|s_t)^{\text{RL}}}/{\pi^{\text{SFT}}(a_t|s_t)})\ \ \ (2) $$
where $x$ is the prompt, $y$ is the response, and $\\textbf{I}(s_t = [\\text{EOS}])$ is the identity function that represents whether $t$ is the last token.
The advantage of this Token-level KL is that it seamlessly integrates with Process Reward Models (PRM) and achieves credit assignment, which is only necessary to add $r^{\text{process}}(s_t, a_t)$at the position of the reward token.