
Share
Researchers introduce Generative Reward Models that use next-token prediction to boost the reasoning abilities of large language models, offering a new method to enhance LLM performance beyond traditional ranking techniques.
In a recent paper titled "Generative Verifiers: Reward Modeling as Next-Token Prediction," researchers from leading institutions propose a novel approach to improve the reasoning performance of large language models (LLMs). The team, led by Lunjun Zhang and including Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal, introduces Generative Reward Models (GenRMs) that leverage next-token prediction to enhance the effectiveness of verifiers.
Traditionally, verifiers or reward models are used to rank multiple candidate solutions generated by LLMs. These verifiers are typically trained as discriminative classifiers, which means they score each solution independently without utilizing the generative capabilities of the LLM. However, this approach has limitations, especially in complex reasoning tasks where context and coherence are crucial.
The key innovation in GenRMs is that they are trained using next-token prediction, a fundamental task for language models. This allows the verifier to generate text and reason about solutions in a more integrated manner. Here’s what makes this approach significant:
The researchers trained GenRMs on a dataset that includes both solutions and verification rationales. Here are some key implementation details:

The performance of GenRMs was evaluated on several benchmark datasets, with notable improvements over existing methods:
Generative Reward Models (GenRMs) represent a significant step forward in enhancing the reasoning capabilities of LLMs. By leveraging next-token prediction, these models can integrate seamlessly with existing architectures, enable detailed chain-of-thought reasoning, and utilize additional compute resources effectively. The results on various benchmarks highlight the potential of GenRMs to significantly improve performance in complex reasoning tasks.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
29 August 2024
88 articles
Related Articles
Related Articles
More Stories