Novel Reward Shaping Technique Enhances RLHF and Mitigates Reward Hacking

Models & Research

The Engineer

3 Mar 2025 · 3 min read

Researchers introduce Preference As Reward, a novel technique that refines reinforcement learning from human feedback, effectively curbing reward hacking and improving model alignment with human values.

Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning large language models (LLMs) with human values. However, one of the significant challenges in RLHF is reward hacking, where the model exploits flaws in the reward function instead of learning the intended behavior. This issue can severely degrade the alignment between the model and human preferences. A recent paper by Fu et al., titled "Reward Shaping to Mitigate Reward Hacking in RLHF," presents a comprehensive study of reward shaping methods and introduces a new approach called Preference As Reward (PAR).

Key Technical Changes and Why They Matter

The authors identify three key design principles for effective reward shaping in RLHF:

Bounded Rewards: The reward function should ideally be bounded to prevent the model from exploiting large, anomalous rewards.
Rapid Initial Growth Followed by Gradual Convergence: The reward function should encourage rapid learning in the initial stages and then stabilize as training progresses.
Centered Reward Function: Formulating the reward as a function of centered reward helps maintain stability and consistency.

PAR: Preference As Reward

Guided by these principles, the authors propose PAR, which leverages latent preferences within the reward model to guide reinforcement learning. This approach aims to mitigate reward hacking by ensuring that the reward signal is more aligned with human values.

Latent Preferences: PAR uses the internal representations of the reward model to capture and utilize latent preferences.
Data Efficiency: PAR requires only a single reference reward for optimal performance, making it highly data-efficient.
Robustness: Even after two full epochs of training, PAR maintains robustness against reward hacking.

Experimental Setup and Results

The authors evaluated PAR on two base models: Gemma2-2B and Llama3-8B. They used two datasets for evaluation:

Ultrafeedback-Binarized
HH-RLHF

Key Findings:

Superior Performance: On the AlpacaEval 2.0 benchmark, PAR achieved a win rate at least 5 percentage points higher than competing approaches.
Data Efficiency: The model demonstrated remarkable data efficiency, requiring only a single reference reward for optimal performance.
Robustness: PAR maintained robustness against reward hacking even after extended training.

Implementation Details

The implementation of PAR involves the following steps:

Reward Model Training: Train a reward model to predict human preferences based on input data.
Latent Preference Extraction: Extract latent preferences from the trained reward model.
Reward Function Formulation: Use these latent preferences to formulate the reward function for reinforcement learning.
Training and Evaluation: Train the LLM using the PAR reward function and evaluate its performance on various benchmarks.

Why This Matters to Practitioners

For practitioners working with RLHF, this research provides a robust framework to mitigate one of the most significant challenges in aligning models with human values. The principles outlined and the PAR approach offer practical solutions that can be applied to improve model performance and reduce the risk of reward hacking. By leveraging latent preferences, PAR not only enhances alignment but also ensures that the model remains stable and efficient throughout training.