Table of contents
Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) [1] is a powerful approach for fine-tuning Large Language Models (LLMs). This method uses PPO algorithm, which is reliable and efficient, along with feedback from human evaluators to improve the quality of model-generated responses. However, training LLMs with PPO presents several challenges. These include maintaining a stable training process and achieving better performance than using Direct Preference Optimization (DPO) [2]. Consequently, we have summarized practical training tricks of RLHF with PPO to help researchers fine-tune LLMs more easily, ensuring both training stability and high performance.
<aside> 💡 For detailed information on PPO, refer to our other blog, System,Mathematics and Code in TRL PPO.
</aside>
We present three types of PPO training techniques: 1) LLM-specific tricks, 2) PPO-specific tricks, and 3) innovative strategies from recent research. The LLM-specific and PPO-specific tricks have been implemented in various RL frameworks [3, 4] and have shown effectiveness. However, the task-specific applicability of the innovative strategies proposed in recent papers remains unverified.
$$ r(s_t, a_t) = \textbf{I}(s_t =[\text{EOS}])r(x,y)-\beta \text{KL}(t) \ \ \ (1) $$
$$ \text{KL}(t) = \log({\pi_{\theta_{\text{old}}}(a_t|s_t)^{\text{RL}}}/{\pi^{\text{SFT}}(a_t|s_t)})\ \ \ (2) $$
where $x$ is the prompt, $y$ is the response, and $\\textbf{I}(s_t = [\\text{EOS}])$ is the identity function that represents whether $t$ is the last token.
Generalized Advantage Estimation (GAE): GAE [10], a $\text{TD}(\lambda)$ return estimation method, is used to estimate token-wise rewards in PPO. In practice, we typically set $\lambda = 1$, transforming the GAE method into a Monte Carlo estimation method.
Set $\lambda$ of GAE to 1, which can reduce the bias introduced by the value network when using rule-based reward model.
Adding SFT Loss: Incorporating an additional supervised next-token prediction loss, alongside the KL divergence, into PPO can preserve the pre-existing abilities of the SFT model.
Token-level loss: When packing samples is enabled, OpenRLHF removes padding from training samples, ensuring that the loss is calculated at the token level rather than per sample, which leads to improved final performance.
Model Initialization: When training LLMs with PPO, it is essential to initialize two models: the actor model and the critic model [6, 7]. Specifically, initializing the actor model with a Supervised Fine-Tuning (SFT) model and the critic model with a reward model ensures efficient PPO training.
Code Link: https://github.com/OpenRLHF/OpenRLHF/blob/188139f809d9d14a8b1d8210f9e6746e2422e4e0/examples/train_ppo.py#L39.
Adam Learning Rate: The Adam learning rate for the actor model is approximately one-tenth of that used for the SFT model. For instance, in OpenRLHF, the Adam learning rate for the SFT model is $5 e^{-6}$, while for the actor model, it is $5 e^{-7}$. Additionally, the Adam learning rate for the critic model is approximately twice that of the SFT model, with an example rate of $9 e^{-6}$.