
Share
Researchers introduce Preference As Reward, a novel technique that refines reinforcement learning from human feedback, effectively curbing reward hacking and improving model alignment with human values.
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning large language models (LLMs) with human values. However, one of the significant challenges in RLHF is reward hacking, where the model exploits flaws in the reward function instead of learning the intended behavior. This issue can severely degrade the alignment between the model and human preferences. A recent paper by Fu et al., titled "Reward Shaping to Mitigate Reward Hacking in RLHF," presents a comprehensive study of reward shaping methods and introduces a new approach called Preference As Reward (PAR).
The authors identify three key design principles for effective reward shaping in RLHF:
Guided by these principles, the authors propose PAR, which leverages latent preferences within the reward model to guide reinforcement learning. This approach aims to mitigate reward hacking by ensuring that the reward signal is more aligned with human values.
The authors evaluated PAR on two base models: Gemma2-2B and Llama3-8B. They used two datasets for evaluation:

The implementation of PAR involves the following steps:
For practitioners working with RLHF, this research provides a robust framework to mitigate one of the most significant challenges in aligning models with human values. The principles outlined and the PAR approach offer practical solutions that can be applied to improve model performance and reduce the risk of reward hacking. By leveraging latent preferences, PAR not only enhances alignment but also ensures that the model remains stable and efficient throughout training.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
3 March 2025
88 articles
Related Articles
Related Articles
More Stories