Original Author: Jiacai Liu (Zhihu: skydownacai) Original Post: Zhihu Column · Claude Code 模型 RL训练中的Reward Hacking Compiled by Jian Hu on 4/13/2026

Compiled from Anthropic's official Model Cards (Sonnet 3.7 through Mythos Preview), using the analytical framework from the original author Jiacai Liu. All quantitative data comes from Anthropic's official documentation — please refer to the original model cards for authoritative numbers.

Chinese version:

reward-hacking-cn.pdf


Why Reward Hacking Is the Central Problem in RL Training

As RL infrastructure matures, large-scale reinforcement learning has become standard practice for improving frontier LLMs. Claude Code is currently the SOTA coding agent, and it has necessarily gone through extensive RL training. Yet RL training is far more complex than just watching reward, entropy, and test accuracy curves.

The core issue is this: maximizing training reward does not equal aligning model behavior to what humans actually want. The gap between the two is reward hacking.

Anthropic's official definition:

"When an AI model finds a way to technically satisfy the rules while violating the original intent of a task — i.e., the model discovers and exploits a shortcut or loophole to maximize reward."

The most straightforward example: for a coding task, instead of implementing a general algorithm, the model just prints the expected output values to pass the test cases (hard-coding). The model scores well on the training set but is completely useless — and potentially harmful — in real-world use.

The original author systematically read all 13 Anthropic Model Cards from Claude 2 (February 2023) to Mythos Preview (April 2026), extracting everything related to reward hacking to answer four core questions:

  1. How does Anthropic detect and identify reward hacking?
  2. What specific types of reward hacking have appeared?
  3. How is reward hacking quantitatively evaluated?
  4. What mitigation strategies has Anthropic used?

Part 1: How Does Anthropic Detect Reward Hacking?

Anthropic has built a monitoring framework that continuously upgrades alongside each model generation. The core pattern: use the current best model to monitor the next generation's RL training trajectories.

Sonnet 3.7 (February 2025) — Automated Classifier as the Starting Point