Sakana AI Leverages LLMs to Discover Novel Preference Optimization Algorithms with LLM²

Models & Research

The Engineer

14 Jun 2024 · 3 min read

Sakana AI's LLM² technique uses large language models to uncover new preference optimization algorithms, introducing DiscoPOP to better align AI with human preferences in a groundbreaking self-referential approach.

At Sakana AI, we're exploring a fascinating self-referential approach called LLM² (‘LLM-squared’) to enhance the training of Large Language Models (LLMs). This method leverages LLMs themselves to discover new algorithms for preference optimization, a critical component in aligning LLMs with human preferences. Our recent report, "Discovering Preference Optimization Algorithms with and for Large Language Models," details this innovative process and introduces a state-of-the-art loss function called DiscoPOP.

The Evolution of AI Research

Historically, the development of deep learning models has relied heavily on trial-and-error by researchers and theoretical insights. This is particularly true for preference optimization algorithms, which are essential for ensuring that LLMs align with human values. Meanwhile, LLMs have become increasingly sophisticated, capable of generating hypotheses and writing code. This raises an intriguing question: can we use AI to automate the process of AI research and discovery?

The Role of Evolutionary Algorithms

Earlier this year, Sakana AI started using evolutionary algorithms to improve the training of foundation models like LLMs. These algorithms are inspired by natural selection and are used to iteratively refine solutions through processes of mutation, crossover, and selection. In a recent paper, we demonstrated that LLMs can act as better evolutionary algorithms themselves.

The LLM² Process

Given these promising results, we embarked on a project to use LLMs to discover new algorithms for training LLMs. We call this process LLM² (‘LLM-squared’), drawing inspiration from meta-learning techniques. Here’s how it works:

Proposal and Synthesis: We set up an LLM-driven discovery pipeline where the model proposes novel preference optimization algorithms.
Evaluation and Selection: These proposed algorithms are evaluated using a suite of held-out tasks, and the best-performing ones are selected for further refinement.

Discovering DiscoPOP

One of the key outcomes of this process is the discovery of a new loss function called Discovered Preference Optimization (DiscoPOP). Here’s what makes DiscoPOP stand out:

Performance: DiscoPOP achieves state-of-the-art performance across multiple evaluation tasks, outperforming existing methods like Direct Preference Optimization (DPO).
Surprising Features: Our analysis reveals surprising and counterintuitive features of DiscoPOP, suggesting that it captures aspects of preference alignment that were previously overlooked.

Technical Details

To give you a deeper understanding, here are some technical details:

Pipeline Architecture:
- Proposal Module: An LLM generates candidate algorithms.
- Evaluation Module: A set of evaluation tasks assess the performance of these candidates.
- Selection Module: The best-performing algorithms are selected for further refinement.
Benchmarks:
- DiscoPOP consistently outperforms DPO and other existing methods across a variety of held-out tasks, demonstrating its robustness and effectiveness.

Open-Sourcing

We are committed to transparency and collaboration. Therefore, we open-source the following:

Model Checkpoints: Tuned model checkpoints for DiscoPOP.
Objective Functions: The discovered objective functions used in the preference optimization process.
Codebase: The complete codebase for running the discovery pipeline is available on GitHub and HuggingFace.

We are also proud to have collaborated with the University of Oxford and Cambridge University on this project.

Future Implications

The potential of our method is vast. By automating the discovery of new optimization algorithms, we can reduce the need for extensive computational resources and explore a wider search space of optimal loss functions. This not only enhances the capabilities of LLMs in various applications but also paves the way for more efficient and effective AI research.

Ultimately, we envision a future where LLM² becomes a standard tool in the AI researcher's toolkit, enabling faster and more innovative developments in the field.