
Share
Researchers propose a return to basics with REINFORCE, showing it can surpass complex algorithms like PPO in training large language models from human feedback, offering efficiency and ease of use.
In the world of AI alignment, Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for enhancing large language models (LLMs). However, the go-to method, Proximal Policy Optimization (PPO), comes with significant computational overhead and tricky hyperparameter tuning. A new paper by Ahmadian et al., titled "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs," argues that simpler is often better. They demonstrate that REINFORCE-style optimization can outperform PPO and other recent methods while being more computationally efficient.
The key insight is that many of the complexities introduced by PPO are not as necessary when aligning LLMs with human preferences. Here’s a breakdown:
Simplification of Components: The authors show that several components of PPO, such as trust region constraints and adaptive KL penalties, are less critical in an RLHF context. These components were originally designed to stabilize training in environments where the policy can change drastically from one update to the next. In contrast, LLMs typically have more stable updates due to their large size and the nature of human feedback.
REINFORCE Performance: The paper revisits REINFORCE, a simpler reinforcement learning algorithm that directly optimizes the expected reward. They find that with careful tuning, REINFORCE can achieve comparable or even better performance than PPO in RLHF tasks.
Benchmark Results: The authors conduct experiments on several benchmark datasets and show that REINFORCE-style methods outperform PPO and other "RL-free" methods like DPO (Direct Preference Optimization) and RAFT (Reward-Aware Fine-Tuning). Specifically:
Computational Efficiency: The computational cost of training with REINFORCE is significantly lower than PPO. For instance, the authors report that training an LLM using REINFORCE required only 60% of the GPU hours compared to PPO on similar tasks.

The paper provides several implementation notes and tips for practitioners:
Reward Shaping: The authors emphasize the importance of reward shaping in RLHF. They suggest using a combination of immediate rewards (e.g., for grammatical correctness) and delayed rewards (e.g., for coherence over longer sequences).
Batch Size and Learning Rate: They recommend starting with smaller batch sizes and gradually increasing them as the model stabilizes. For learning rates, they advise a range between 1e-5 and 1e-4, depending on the specific task.
For practitioners working on aligning LLMs with human preferences, this paper offers a compelling case for revisiting simpler reinforcement learning algorithms. The computational efficiency of REINFORCE makes it particularly attractive for large-scale applications where resource constraints are a concern. Additionally, the simplicity of the method can lead to faster prototyping and iteration cycles.
The findings by Ahmadian et al. challenge the prevailing notion that complex methods like PPO are necessary for effective RLHF. By revisiting and refining simpler algorithms like REINFORCE, practitioners can achieve high performance with lower computational costs. This work is a valuable addition to the growing body of research on AI alignment and offers practical insights for those working in the field.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
26 February 2024
88 articles
Related Articles
Related Articles
More Stories