Authors: Jian Hu, Mingjie Liu, Shizhe Diao, Ximing Lu, Xin Dong, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
First Published: August 11, 2025
One of the most compelling questions in AI is whether large language models (LLMs) can continue to improve through sustained reinforcement learning (RL), or if their capabilities will eventually plateau.
ProRLv2 is the latest evolution of our Prolonged Reinforcement Learning (ProRL) regime, specifically designed to test the effects of extended RL training on LLMs. Leveraging advanced algorithms, rigorous regularization, and comprehensive domain coverage, ProRLv2 pushes the boundaries well beyond typical RL training schedules. Our experiments systematically explore whether models can achieve measurable progress when subjected to thousands of additional RL steps.
Today, we're excited to announce the release of ProRLv2, building on the foundation of our earlier ProRL work. In this update, we'll explore its key innovations, advanced methods, and new empirical results that achieve new state-of-the-art—shedding light on how large language models can continue to learn and improve.
Most approaches—chain-of-thought prompting, tree search—help models better exploit knowledge they already possess. RL, especially with rigorous, programmatically-verifiable rewards, holds the promise to push models into genuinely new territory. However, traditional short-horizon RL techniques often suffer from instability and quickly diminishing returns, earning a reputation as “temperature distillation” rather than a true enabler of boundary expansion.
ProRL fundamentally challenges this paradigm:
Conventional RL training | What ProRL Does |
---|---|
Few-hundred steps, one domain | 3,000+ steps, five domains |
Entropy collapse, KL spikes | PPO-Clip, REINFORCE++-baseline, Clip-Higher, Dynamic Sampling, Reference resets |
Risky reward model drift | Fully verifiable rewards |
Verbose, lengthy outputs | Scheduled cosine length penalty |
Goal: Move beyond re-sampling familiar solutions to genuinely expanding what the model can discover.
At ProRL’s core is the clipped PPO loss, which stabilizes policy updates by restricting how much the new policy can diverge from the old ones: