👋 Welcome to Jian Hu’s Blog

Email: [email protected]

Homepage: https://hujian.website/

GitHub: https://github.com/hijkzzz

Google Scholar: https://scholar.google.com/citations?user=-xt5vGkAAAAJ

LinkedIn: https://www.linkedin.com/in/jian-hu-060979238

Zhihu: https://www.zhihu.com/people/chu-qi-6-41/posts

I’m a RLer + NLPer / 2 + MLSyser / 2

Reinforcement Learning from Human Feedback

Stabilizing MoE RL Without Router Replay: The Online IcePop Solution

Authors: Jian Hu

TL;DR: Online IcePop stabilizes MoE RL training by combining IcePop masking with online policy gradients, eliminating the need for Router Replay mechanisms.

A Brief Analysis of KL Approximation Methods (k1, k2, k3) in RLHF/RLVR

Authors: Jian Hu

TL;DR: In RLHF training, a Kullback-Leibler (KL) divergence (KL) constraint is used to keep the policy model ($\pi$) from straying too far from the reference model ($\pi_{ref}$). This post analyzes three common estimators (k1, k2, k3) used to approximate the reverse KL gradient.

ProRL V2 - Prolonged Training Validates RL Scaling Laws

Authors: Jian Hu, Mingjie Liu, Shizhe Diao, Ximing Lu, Xin Dong, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

TL;DR: ProRLv2 is the latest evolution of our Prolonged Reinforcement Learning (ProRL) regime, specifically designed to test the effects of extended RL training on LLMs.

Best Practices in RL for Reasoning-Capable LLMs: Insights from Mistral’s Magistral Report