Table of contents
Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) [1] is a powerful approach for fine-tuning Large Language Models (LLMs). This method uses PPO algorithm, which is reliable and efficient, along with feedback from human evaluators to improve the quality of model-generated responses. However, training LLMs with PPO presents several challenges. These include maintaining a stable training process and achieving better performance than using Direct Preference Optimization (DPO) [2]. Consequently, we have summarized practical training tricks of RLHF with PPO to help researchers fine-tune LLMs more easily, ensuring both training stability and high performance.
<aside> 💡 For detailed information on PPO, refer to our other blog, System,Mathematics and Code in TRL PPO.
</aside>
We present three types of PPO training techniques: 1) LLM-specific tricks, 2) PPO-specific tricks, and 3) innovative strategies from recent research. The LLM-specific and PPO-specific tricks have been implemented in various RL frameworks [3, 4] and have shown effectiveness. However, the task-specific applicability of the innovative strategies proposed in recent papers remains unverified.
$$ r(s_t, a_t) = \textbf{I}(s_t =[\text{EOS}])r(x,y)-\beta \text{KL}(t) \ \ \ (1) $$
$$ \text{KL}(t) = \log({\pi_{\theta_{\text{old}}}(a_t|s_t)^{\text{RL}}}/{\pi^{\text{SFT}}(a_t|s_t)})\ \ \ (2) $$
where $x$ is the prompt, $y$ is the response, and $\\textbf{I}(s_t = [\\text{EOS}])$ is the identity function that represents whether $t$ is the last token.
Generalized Advantage Estimation (GAE): GAE [10], a $\text{TD}(\lambda)$ return estimation method, is used to estimate token-wise rewards in PPO. In practice, we typically set $\lambda = 1$, transforming the GAE method into a Monte Carlo estimation method.
Adding SFT Loss: Incorporating an additional supervised next-token prediction loss, alongside the KL divergence, into PPO can preserve the pre-existing abilities of the SFT model.
Model Initialization: When training LLMs with PPO, it is essential to initialize two models: the actor model and the critic model [6, 7]. Specifically, initializing the actor model with a Supervised Fine-Tuning (SFT) model and the critic model with a reward model ensures efficient PPO training.
Code Link: https://github.com/OpenRLHF/OpenRLHF/blob/188139f809d9d14a8b1d8210f9e6746e2422e4e0/examples/train_ppo.py#L39.
Adam Learning Rate: The Adam learning rate for the actor model is approximately one-tenth of that used for the SFT model. For instance, in OpenRLHF, the Adam learning rate for the SFT model is $5 e^{-6}$, while for the actor model, it is $5 e^{-7}$. Additionally, the Adam learning rate for the critic model is approximately twice that of the SFT model, with an example rate of $9 e^{-6}$.
Mini-batch Updates: During the learning phase, the PPO implementation shuffles the indices of the training data, which is of size $N \times M$ (where $N$ is the size of the replay buffer and $M$ is the length of each response), and breaks it into mini-batches to compute the gradient and update the policy.
Code Link: <https://github.com/OpenRLHF/OpenRLHF/blob/9d8b3fdac345f6a18b37d73c53bfb95a652d1db2/openrlhf/trainer/ppo_trainer.py#L216>.