Author: Jian Hu

First published on: 2024/12/26

Technical Resport https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS

Introduction

RLHF (Reinforcement Learning from Human Feedback) is rapidly evolving, with algorithms such as PPO, DPO, RLOO, ReMax and GRPO emerging one after another. By integrating various optimization techniques from Proximal Policy Optimization (PPO) into the traditional REINFORCE algorithm, we “proposed” REINFORCE++, which aims to enhance performance and stability in RLHF while reducing computational resource requirements without the critic network.

The key feature of REINFORCE++ is that it is more stable than GRPO and faster than PPO.

What is REINFORCE?

REINFORCE is a crucial and simple policy gradient method in reinforcement learning designed to maximize expected cumulative rewards through direct policy optimization. The algorithm operates based on Monte Carlo methods, following these key steps:

Key Implementation Tricks in REINFORCE++

To stabilize model training, several optimization tricks are integrated into REINFORCE:

Token Level KL-Penalty

The KL-Divergence between the response distributions of the RL model and the SFT model is calculated for each token. This divergence is then incorporated as a penalty term in the reward function during training. Specifically, the per-token reward is represented as follows:

$$ r(s_t, a_t) = \textbf{I}(s_t =[\text{EOS}])r(x,y)-\beta \text{KL}(t) \ \ \ (1) $$

$$ \text{KL}(t) = \log({\pi_{\theta_{\text{old}}}(a_t|s_t)^{\text{RL}}}/{\pi^{\text{SFT}}(a_t|s_t)})\ \ \ (2) $$

    where $x$ is the prompt, $y$ is the response, and $\\textbf{I}(s_t = [\\text{EOS}])$ is the identity function that represents whether $t$ is the last token.

The advantage of this Token-level KL is that it seamlessly integrates with Process Reward Models (PRM) and achieves credit assignment, which is only necessary to add $r^{\text{process}}(s_t, a_t)$at the position of the reward token.