Table of contents

0 - Introduction

Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) [1] is a powerful approach for fine-tuning Large Language Models (LLMs). This method uses PPO algorithm, which is reliable and efficient, along with feedback from human evaluators to improve the quality of model-generated responses. However, training LLMs with PPO presents several challenges. These include maintaining a stable training process and achieving better performance than using Direct Preference Optimization (DPO) [2]. Consequently, we have summarized practical training tricks of RLHF with PPO to help researchers fine-tune LLMs more easily, ensuring both training stability and high performance.

<aside> 💡 For detailed information on PPO, refer to our other blog, System,Mathematics and Code in TRL PPO.

</aside>

1 - Advanced Tricks for Training LLM with PPO

We present three types of PPO training techniques: 1) LLM-specific tricks, 2) PPO-specific tricks, and 3) innovative strategies from recent research. The LLM-specific and PPO-specific tricks have been implemented in various RL frameworks [3, 4] and have shown effectiveness. However, the task-specific applicability of the innovative strategies proposed in recent papers remains unverified.

1.1 - LLM-specific Tricks

$$ r(s_t, a_t) = \textbf{I}(s_t =[\text{EOS}])r(x,y)-\beta \text{KL}(t) \ \ \ (1) $$

$$ \text{KL}(t) = \log({\pi_{\theta_{\text{old}}}(a_t|s_t)^{\text{RL}}}/{\pi^{\text{SFT}}(a_t|s_t)})\ \ \ (2) $$

    where $x$ is the prompt, $y$ is the response, and $\\textbf{I}(s_t = [\\text{EOS}])$ is the identity function that represents whether $t$ is the last token.

Code Link: https://github.com/OpenRLHF/OpenRLHF/blob/f8bfc76f1fc6fcf43241104dbee144a3be51ee93/openrlhf/models/utils.py#L56.


1.2 - PPO-specific Tricks