Author: Jian Hu
First published on: 2025/12/16
In the post-training phase of Large Language Models (LLMs), Mixture-of-Experts (MoE) models have struck an excellent balance between inference efficiency and model capacity due to their sparse activation nature. However, when it comes to Reinforcement Learning (RL) training—such as PPO—MoE architectures introduce a tricky stability challenge.
In this post, we explore a combined strategy known as Online IcePop, which integrates the IcePop algorithm from Ant Group's Bailing Team with the recent Online Policy Gradient findings from the Qwen Team. This approach not only effectively stabilizes MoE training but also allows us to completely discard the complex and expensive "Router Replay" mechanism.
In Off-policy (or approximate On-policy) algorithms like PPO (Proximal Policy Optimization), we typically rely on Importance Sampling (IS) to correct the deviation between the old strategy (Behavior Policy) and the current strategy (Target Policy). The IS weight $\rho_t$ is defined as:
$$ \rho_t = \frac{\pi_{\text{new}}(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} $$
For dense models, policy changes are usually smooth. However, for MoE models, the situation is drastically different:
These fluctuations cause the Importance Sampling Weights to hit extreme values, resulting in excessive gradient variance and model divergence. To mitigate this, the industry has often resorted to Router Replay (recomputing the Router path for old data during updates)—a complex and computationally expensive workaround.
To address the issue of exploding IS weights, Ant Group's Bailing Team proposed an elegant solution in their paper IcePop: An Effective Method for MoE Stability.
**Truncated Importance Sampling is typically used to correct precision errors between vLLM (inference engine) and FSDP (training engine)**. However, this is often insufficient for MoE. IcePop's core innovation is adding a Masking operation on top of truncation:


Core Mechanism: When the Importance Sampling Weight exceeds a preset threshold range, IcePop does not just clip it; it Masks it out entirely (treating it as an invalid sample or zero contribution).
This approach may seem aggressive, but it precisely filters out samples that are "statistically unreliable" due to expert mutations, significantly reducing the variance of the estimator.