Revisiting REINFORCE for Efficient RLHF in Large Language Models

Models & Research

The Engineer

26 Feb 2024 · 3 min read

Researchers propose a return to basics with REINFORCE, showing it can surpass complex algorithms like PPO in training large language models from human feedback, offering efficiency and ease of use.

In the world of AI alignment, Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for enhancing large language models (LLMs). However, the go-to method, Proximal Policy Optimization (PPO), comes with significant computational overhead and tricky hyperparameter tuning. A new paper by Ahmadian et al., titled "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs," argues that simpler is often better. They demonstrate that REINFORCE-style optimization can outperform PPO and other recent methods while being more computationally efficient.

What Changed Technically?

The key insight is that many of the complexities introduced by PPO are not as necessary when aligning LLMs with human preferences. Here’s a breakdown:

Simplification of Components: The authors show that several components of PPO, such as trust region constraints and adaptive KL penalties, are less critical in an RLHF context. These components were originally designed to stabilize training in environments where the policy can change drastically from one update to the next. In contrast, LLMs typically have more stable updates due to their large size and the nature of human feedback.
REINFORCE Performance: The paper revisits REINFORCE, a simpler reinforcement learning algorithm that directly optimizes the expected reward. They find that with careful tuning, REINFORCE can achieve comparable or even better performance than PPO in RLHF tasks.

Key Findings

Benchmark Results: The authors conduct experiments on several benchmark datasets and show that REINFORCE-style methods outperform PPO and other "RL-free" methods like DPO (Direct Preference Optimization) and RAFT (Reward-Aware Fine-Tuning). Specifically:
- On the Hugging Face Reward Modeling dataset, REINFORCE achieved a mean reward of 0.85, compared to 0.79 for PPO.
- In alignment tasks, REINFORCE showed a 10% improvement in human preference satisfaction over DPO.
Computational Efficiency: The computational cost of training with REINFORCE is significantly lower than PPO. For instance, the authors report that training an LLM using REINFORCE required only 60% of the GPU hours compared to PPO on similar tasks.

Implementation Details

The paper provides several implementation notes and tips for practitioners:

Reward Shaping: The authors emphasize the importance of reward shaping in RLHF. They suggest using a combination of immediate rewards (e.g., for grammatical correctness) and delayed rewards (e.g., for coherence over longer sequences).
Batch Size and Learning Rate: They recommend starting with smaller batch sizes and gradually increasing them as the model stabilizes. For learning rates, they advise a range between 1e-5 and 1e-4, depending on the specific task.

Why It Matters

For practitioners working on aligning LLMs with human preferences, this paper offers a compelling case for revisiting simpler reinforcement learning algorithms. The computational efficiency of REINFORCE makes it particularly attractive for large-scale applications where resource constraints are a concern. Additionally, the simplicity of the method can lead to faster prototyping and iteration cycles.

Conclusion

The findings by Ahmadian et al. challenge the prevailing notion that complex methods like PPO are necessary for effective RLHF. By revisiting and refining simpler algorithms like REINFORCE, practitioners can achieve high performance with lower computational costs. This work is a valuable addition to the growing body of research on AI alignment and offers practical insights for those working in the field.