👋 Welcome to Jian Hu’s Blog
Email: [email protected]
Homepage: https://hujian.website/
GitHub: https://github.com/hijkzzz
Google Scholar: https://scholar.google.com/citations?user=-xt5vGkAAAAJ
LinkedIn: https://www.linkedin.com/in/jian-hu-060979238
Zhihu: https://www.zhihu.com/people/chu-qi-6-41/posts
I’m a RLer + NLPer / 2 + MLSyser / 2
ProRL V2 - Prolonged Training Validates RL Scaling Laws
Authors: Jian Hu, Mingjie Liu, Shizhe Diao, Ximing Lu, Xin Dong, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
TL;DR: ProRLv2 is the latest evolution of our Prolonged Reinforcement Learning (ProRL) regime, specifically designed to test the effects of extended RL training on LLMs.
Best Practices in RL for Reasoning-Capable LLMs: Insights from Mistral’s Magistral Report
Authors: Jian Hu
TL;DR: Magistral combines PPO-Clip, REINFORCE++-style advantage normalization, and DAPO tricks like Dynamic Sampling into a solid RLHF recipe for reasoning LLMs.
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
Authors: Jian Hu
TL;DR: In this blog, we introduce the REINFORCE++ algorithm, which integrates various optimization tricks from Proximal Policy Optimization (PPO) into the REINFORCE to achieve stable and efficient training in LLM alignment.