Stabilizing MoE RL Without Router Replay: The Online IcePop/Seq-level Mask TIS Solution

Author: Jian Hu

First published on: 2025/12/16

Update on: 2026/2/6

In the post-training phase of Large Language Models (LLMs), Mixture-of-Experts (MoE) models have struck an excellent balance between inference efficiency and model capacity due to their sparse activation nature. However, when it comes to Reinforcement Learning (RL) training—such as PPO—MoE architectures introduce a tricky stability challenge.

In this post, we explore two powerful solutions for MoE RL stability. The first is Online IcePop, which integrates the IcePop algorithm from Ant Group's Bailing Team with the recent Online Policy Gradient findings from the Qwen Team. The second is Online Seq-level Mask TIS, a sequence-level masking approach developed through discussions with the research community. Both approaches not only effectively stabilize MoE training but also allow us to completely discard the complex and expensive "Router Replay" mechanism.

1. Why is MoE RL Training So Unstable?

In Off-policy (or approximate On-policy) algorithms like PPO (Proximal Policy Optimization), we typically rely on Importance Sampling (IS) to correct the deviation between the old strategy (Behavior Policy) and the current strategy (Target Policy). The IS weight $\rho_t$ is defined as:

$$ \rho_t = \frac{\pi_{\text{new}}(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} $$

For dense models, policy changes are usually smooth. However, for MoE models, the situation is drastically different:

Micro-perturbations in Router Logits: During training, even tiny changes in the Router's output logits can lead to discrete jumps in the selected Experts.
Drastic Probability Shifts: Once the expert selection changes, the probability ratio of $\pi_{\text{new}}$ to $\pi_{\text{old}}$ on a specific path can fluctuate wildly.

These fluctuations cause the Importance Sampling Weights to hit extreme values, resulting in excessive gradient variance and model divergence. To mitigate this, the industry has often resorted to Router Replay (recomputing the Router path for old data during updates)—a complex and computationally expensive workaround.

2. Puzzle Piece #1: Introducing IcePop

To address the issue of exploding IS weights, Ant Group's Bailing Team proposed an elegant solution in their paper IcePop: An Effective Method for MoE Stability.

**Truncated Importance Sampling is typically used to correct precision errors between vLLM (inference engine) and FSDP (training engine)**. However, this is often insufficient for MoE. IcePop's core innovation is adding a Masking operation on top of truncation:

Core Mechanism: When the Importance Sampling Weight exceeds a preset threshold range, IcePop does not just clip it; it Masks it out entirely (treating it as an invalid sample or zero contribution).

This approach may seem aggressive, but it precisely filters out samples that are "statistically unreliable" due to expert mutations, significantly reducing the variance of the estimator.