👋 Welcome to Jian Hu’s Blog

Email: [email protected]

Homepage: https://hujian.website/

GitHub: https://github.com/hijkzzz

Google Scholar: https://scholar.google.com/citations?user=-xt5vGkAAAAJ

LinkedIn: https://www.linkedin.com/in/jian-hu-060979238

Zhihu: https://www.zhihu.com/people/chu-qi-6-41/posts

I’m a RLer + NLPer / 2 + MLSyser / 2

Reinforcement Learning from Human Feedback

Best Practices in RL for Reasoning-Capable LLMs: Insights from Mistral’s Magistral Report

Authors: Jian Hu

TL;DR: Magistral combines PPO-Clip, REINFORCE++-style advantage normalization, and DAPO tricks like Dynamic Sampling into a solid RLHF recipe for reasoning LLMs.

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Authors: Jian Hu

TL;DR: In this blog, we introduce the REINFORCE++ algorithm, which integrates various optimization tricks from Proximal Policy Optimization (PPO) into the REINFORCE to achieve stable and efficient training in LLM alignment.

Exploring OpenAI O1 Model Replication

Authors: Jian Hu

TL;DR: By diving deep into relevant research and collaborating with experts, I’ve compiled and hypothesized several potential strategies based on Awesome LLM Strawberry (OpenAI o1) - GitHub for replicating O1 models. This post outlines these findings for further exploration.

Unraveling RLHF and Its Variants: Progress and Practical Engineering Insights