Authors: Jian Hu, Mingjie Liu, Shizhe Diao, Ximing Lu, Xin Dong, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

First Published: August 11, 2025


Introduction

One of the most compelling questions in AI is whether large language models (LLMs) can continue to improve through sustained reinforcement learning (RL), or if their capabilities will eventually plateau.

ProRLv2 is the latest evolution of our Prolonged Reinforcement Learning (ProRL) regime, specifically designed to test the effects of extended RL training on LLMs. Leveraging advanced algorithms, rigorous regularization, and comprehensive domain coverage, ProRLv2 pushes the boundaries well beyond typical RL training schedules. Our experiments systematically explore whether models can achieve measurable progress when subjected to thousands of additional RL steps.

Today, we're excited to announce the release of ProRLv2, building on the foundation of our earlier ProRL work. In this update, we'll explore its key innovations, advanced methods, and new empirical results that achieve new state-of-the-art—shedding light on how large language models can continue to learn and improve.

What Sets ProRL Apart?

Most approaches—chain-of-thought prompting, tree search—help models better exploit knowledge they already possess. RL, especially with rigorous, programmatically-verifiable rewards, holds the promise to push models into genuinely new territory. However, traditional short-horizon RL techniques often suffer from instability and quickly diminishing returns, earning a reputation as “temperature distillation” rather than a true enabler of boundary expansion.

ProRL fundamentally challenges this paradigm:

Comparison Table

Conventional RL training What ProRL Does
Few-hundred steps, one domain 3,000+ steps, five domains
Entropy collapse, KL spikes PPO-Clip, REINFORCE++-baseline, Clip-Higher, Dynamic Sampling, Reference resets
Risky reward model drift Fully verifiable rewards
Verbose, lengthy outputs Scheduled cosine length penalty

Goal: Move beyond re-sampling familiar solutions to genuinely expanding what the model can discover.

Core Techniques: ProRL Algorithms & Regularizers

1. Proximal Policy Optimization (PPO-Clip) with REINFORCE++-baseline

At ProRL’s core is the clipped PPO loss, which stabilizes policy updates by restricting how much the new policy can diverge from the old ones: