Generative Verifiers: Enhancing LLM Performance with Next-Token Prediction

Models & Research

The Engineer

29 Aug 2024 · 3 min read

Researchers introduce Generative Reward Models that use next-token prediction to boost the reasoning abilities of large language models, offering a new method to enhance LLM performance beyond traditional ranking techniques.

In a recent paper titled "Generative Verifiers: Reward Modeling as Next-Token Prediction," researchers from leading institutions propose a novel approach to improve the reasoning performance of large language models (LLMs). The team, led by Lunjun Zhang and including Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal, introduces Generative Reward Models (GenRMs) that leverage next-token prediction to enhance the effectiveness of verifiers.

What Changed Technically?

Traditionally, verifiers or reward models are used to rank multiple candidate solutions generated by LLMs. These verifiers are typically trained as discriminative classifiers, which means they score each solution independently without utilizing the generative capabilities of the LLM. However, this approach has limitations, especially in complex reasoning tasks where context and coherence are crucial.

The key innovation in GenRMs is that they are trained using next-token prediction, a fundamental task for language models. This allows the verifier to generate text and reason about solutions in a more integrated manner. Here’s what makes this approach significant:

Seamless Integration with Instruction Tuning: Since GenRMs use the same next-token prediction objective as LLMs, they can be fine-tuned alongside other tasks without additional complexity.
Chain-of-Thought Reasoning: By generating text, GenRMs can produce step-by-step reasoning processes, which is particularly useful for algorithmic and mathematical problems.
Test-Time Compute Utilization: At inference time, GenRMs can leverage majority voting to improve the accuracy of their predictions. This means that by running the model multiple times and aggregating results, the overall performance can be significantly enhanced.

Implementation Details

The researchers trained GenRMs on a dataset that includes both solutions and verification rationales. Here are some key implementation details:

Training Objective: The model is trained to predict the next token in sequences that include both the problem statement and potential solutions.
Dataset: A combination of real and synthetic data was used, with synthetic verification rationales generated by another LLM to provide diverse training examples.
Model Architecture: The GenRM architecture is based on transformer models, similar to those used for standard next-token prediction tasks. This ensures that the model can leverage pretraining benefits.

Benchmarks and Results

The performance of GenRMs was evaluated on several benchmark datasets, with notable improvements over existing methods:

Algorithmic Tasks: On algorithmic reasoning tasks, GenRM improved performance from 5% to 45.3%.
GSM8K: For the GSM8K dataset, which focuses on multi-step arithmetic problems, GenRM achieved a significant jump from 73% to 93.4% accuracy.
MATH and MMLU Abstract Algebra: In easy-to-hard generalization settings, GenRM showed improvements of 28% to 44.6% on the MATH dataset and 37.9% to 53.5% on the MMLU abstract algebra subset.

Additional Findings

Error Detection: The researchers found that training GenRMs with synthetic verification rationales was sufficient to identify subtle errors in math problems, demonstrating the model's ability to reason about complex solutions.
Scalability: GenRM scales well with both model size and test-time compute. Larger models and more inference iterations consistently lead to better performance.

Conclusion

Generative Reward Models (GenRMs) represent a significant step forward in enhancing the reasoning capabilities of LLMs. By leveraging next-token prediction, these models can integrate seamlessly with existing architectures, enable detailed chain-of-thought reasoning, and utilize additional compute resources effectively. The results on various benchmarks highlight the potential of GenRMs to significantly improve performance in complex reasoning tasks.