Stabilizing Long-Horizon Fine Tuning: ProRLv2’s Strategies for High Quality, LLM Improvement

Authors: Jian Hu, Mingjie Liu, Shizhe Diao, Ximing Lu, Xin Dong, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

First Published: August 11, 2025

Introduction

One of the most compelling questions in AI is whether large language models (LLMs) can continue to improve through sustained reinforcement learning (RL), or if their capabilities will eventually plateau.

ProRLv2 is the latest evolution of our Prolonged Reinforcement Learning (ProRL) regime, specifically designed to test the effects of extended RL training on LLMs. Leveraging advanced algorithms, rigorous regularization, and comprehensive domain coverage, ProRLv2 pushes the boundaries well beyond typical RL training schedules. Our experiments systematically explore whether models can achieve measurable progress when subjected to thousands of additional RL steps.

Today, we're excited to announce the release of ProRLv2, building on the foundation of our earlier ProRL work. In this update, we'll explore its key innovations, advanced methods, and new empirical results that achieve new state-of-the-art—shedding light on how large language models can continue to learn and improve.

What Sets ProRL Apart?

Most approaches—chain-of-thought prompting, tree search—help models better exploit knowledge they already possess. RL, especially with rigorous, programmatically-verifiable rewards, holds the promise to push models into genuinely new territory. However, traditional short-horizon RL techniques often suffer from instability and quickly diminishing returns, earning a reputation as “temperature distillation” rather than a true enabler of boundary expansion.

ProRL fundamentally challenges this paradigm:

Extended training: Over 3,000 RL steps across five distinct domains, achieving new state‑of‑the‑art performance among 1.5 B reasoning models.
Stability and robustness: Incorporates KL-regularized trust regions, periodic reference policy resets, and scheduled length regularization.
Fully verifiable rewards: Every reward signal is determined programmatically and is always checkable.
Brevity enforced: Scheduled cosine length penalties ensure outputs remain concise and efficient.

Comparison Table

Conventional RL training	What ProRL Does
Few-hundred steps, one domain	3,000+ steps, five domains
Entropy collapse, KL spikes	PPO-Clip, REINFORCE++-baseline, Clip-Higher, Dynamic Sampling, Reference resets
Risky reward model drift	Fully verifiable rewards
Verbose, lengthy outputs	Scheduled cosine length penalty

Goal: Move beyond re-sampling familiar solutions to genuinely expanding what the model can discover.

Core Techniques: ProRL Algorithms & Regularizers

1. Proximal Policy Optimization (PPO-Clip) with REINFORCE++-baseline

At ProRL’s core is the clipped PPO loss, which stabilizes policy updates by restricting how much the new policy can diverge from the old ones: