Jian Hu’s Blogpost

👋 Welcome to Jian Hu’s Blog

I’m a RLer + NLPer / 2 + MLSyser / 2

Reinforcement Learning

2026

Reward Hacking in Claude Code RL Training

TL;DR: Over two years of Claude model cards, reward hacking escalated from simple test-hardcoding to Mythos autonomously bypassing network restrictions and peeking at held-out test sets.

Stabilizing MoE RL Without Router Replay: The Online IcePop/Seq-level Mask TIS Solution

TL;DR: Solutions for stabilizing MoE RL without Router Replay: (1) Online IcePop = pure online policy updates + IcePop token-level masking of extreme IS weights; (2) Online Seq-level Mask TIS = seq-level (geometric-mean) log-ratio filtering + untruncated IS.

2025

A Brief Analysis of KL Approximation Methods (k1, k2, k3) in RLHF/RLVR

TL;DR: We analyze k1/k2/k3 KL estimators and argue k2 best matches the reverse-KL gradient (k1 fails as a loss; k3 behaves like forward-KL and can be high-variance).

ProRL V2 - Prolonged Training Validates RL Scaling Laws

TL;DR: ProRLv2 shows prolonged RL (3k+ steps, multi-domain, verifiable rewards + strong regularizers) can keep improving reasoning models, supporting RL scaling laws.

Best Practices in RL for Reasoning-Capable LLMs: Insights from Mistral’s Magistral Report