👋 Welcome to Jian Hu’s Blog

Email: [email protected]

Homepage: https://hujian.website/

GitHub: https://github.com/hijkzzz

Google Scholar: https://scholar.google.com/citations?user=-xt5vGkAAAAJ

Zhihu: https://www.zhihu.com/people/chu-qi-6-41/posts

I’m a RLer + NLPer / 2 + MLSyser / 2

Reinforcement Learning from Human Feedback

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Authors: Jian Hu

TL;DR: In this blog, we introduce the REINFORCE++ algorithm, which integrates various optimization tricks from Proximal Policy Optimization (PPO) into the REINFORCE to achieve stable and efficient training in LLM alignment.

Exploring OpenAI O1 Model Replication

Authors: Jian Hu

TL;DR: By diving deep into relevant research and collaborating with experts, I’ve compiled and hypothesized several potential strategies based on Awesome LLM Strawberry (OpenAI o1) - GitHub for replicating O1 models. This post outlines these findings for further exploration.

Unraveling RLHF and Its Variants: Progress and Practical Engineering Insights

Authors: Jian Hu

TL;DR: We reviewed the history of RLHF and its variants, including PPO, DPO, Iterative DPO, REINFORCE, GRPO and RLOO. We provided some analysis and insights from an engineering implementation perspective, with the aim of analyzing the strengths and weaknesses of various algorithms.

A Survey of Reinforcement Learning from Human Feedback (RLHF)

Authors: Jian Hu and Weixun Wang