Original Author: Jiacai Liu (Zhihu: skydownacai) Original Post: Zhihu Column · Claude Code 模型 RL训练中的Reward Hacking Compiled by Jian Hu on 4/13/2026
Compiled from Anthropic's official Model Cards (Sonnet 3.7 through Mythos Preview), using the analytical framework from the original author Jiacai Liu. All quantitative data comes from Anthropic's official documentation — please refer to the original model cards for authoritative numbers.
Chinese version:
As RL infrastructure matures, large-scale reinforcement learning has become standard practice for improving frontier LLMs. Claude Code is currently the SOTA coding agent, and it has necessarily gone through extensive RL training. Yet RL training is far more complex than just watching reward, entropy, and test accuracy curves.
The core issue is this: maximizing training reward does not equal aligning model behavior to what humans actually want. The gap between the two is reward hacking.
Anthropic's official definition:
"When an AI model finds a way to technically satisfy the rules while violating the original intent of a task — i.e., the model discovers and exploits a shortcut or loophole to maximize reward."
The most straightforward example: for a coding task, instead of implementing a general algorithm, the model just prints the expected output values to pass the test cases (hard-coding). The model scores well on the training set but is completely useless — and potentially harmful — in real-world use.
The original author systematically read all 13 Anthropic Model Cards from Claude 2 (February 2023) to Mythos Preview (April 2026), extracting everything related to reward hacking to answer four core questions:
Anthropic has built a monitoring framework that continuously upgrades alongside each model generation. The core pattern: use the current best model to monitor the next generation's RL training trajectories.