Emulating Fine-Tuning with Small Models to Enhance Large Language Models

Models & Research

The Engineer

30 Oct 2023 · 3 min read

Researchers at Stanford University and Google unveil Emulated Fine-Tuning, a technique that isolates the effects of pre-training and fine-tuning in large language models, enhancing their accuracy without full-scale retraining.

In a recent paper, researchers from Stanford University and Google introduce a novel technique called "Emulated Fine-Tuning" (EFT) that decouples the knowledge gained during pre-training and fine-tuning stages of large language models (LLMs). This method allows for a more nuanced understanding of how these two stages contribute to model performance, particularly in terms of helpfulness and factuality.

What Changed Technically

Traditionally, LLMs are built using a two-stage pipeline: pre-training on vast amounts of diverse text data and fine-tuning (or alignment) on targeted examples to refine specific behaviors. The assumption has been that pre-training imparts broad knowledge and skills, while fine-tuning filters and refines this knowledge. However, this hypothesis hasn't been thoroughly tested.

To address this gap, the researchers developed EFT, a technique that uses reinforcement learning (RL) to sample from a distribution that approximates the results of pre-training and fine-tuning at different scales. This allows for a direct comparison of how scaling up or down each stage affects model performance.

Key Findings

Scaling Fine-Tuning Improves Helpfulness: The study found that increasing the amount of data used in fine-tuning tends to improve the model's helpfulness, making it better at following instructions and providing useful responses.
Scaling Pre-Training Enhances Factuality: Conversely, scaling up pre-training with more diverse text data improves the model's factuality, ensuring that the information provided is accurate and reliable.

How It Works

EFT operates by creating an emulator that approximates the behavior of a large LLM fine-tuned on a specific task. Here’s how it works:

RL Framework: The researchers use an RL-based framework inspired by recent advancements in learning from human preferences. This framework allows the model to learn and optimize for desired behaviors during the fine-tuning stage.
Distribution Sampling: EFT samples from a distribution that mimics the results of pre-training and fine-tuning at different scales. This enables the decoupling of knowledge gained from each stage, providing insights into their individual contributions.

Practical Applications

One of the most exciting applications of EFT is "LM up-scaling," a special case where small, fine-tuned models are ensembled with large pre-trained models to emulate the result of fine-tuning the large model. This approach offers several benefits:

Resource Efficiency: Avoids the resource-intensive process of fine-tuning large pre-trained models.
Performance Improvement: Consistently improves both helpfulness and factuality in instruction-following models across different LLM families, including Llama, Llama-2, and Falcon.
Flexibility: Enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.

Implementation Details

Ensembling Models: The up-scaling technique involves combining the outputs of a large pre-trained model with those of a small fine-tuned model. This is done by weighting their contributions based on the task requirements.
Hyperparameter-Free: Unlike traditional fine-tuning, which often requires extensive hyperparameter tuning, LM up-scaling works out-of-the-box without additional adjustments.

Why It Matters to Practitioners

For practitioners working with LLMs, EFT and LM up-scaling offer several practical advantages:

Insight into Model Behavior: By decoupling pre-training and fine-tuning, researchers can better understand how each stage contributes to model performance.
Efficient Fine-Tuning: The ability to emulate the results of fine-tuning a large model using smaller models reduces computational costs and time.
Customizable Models: Test-time adjustment of behavioral traits allows for more flexible and adaptable models that can be fine-tuned on-the-fly.

Conclusion

The introduction of Emulated Fine-Tuning (EFT) marks a significant step forward in understanding and optimizing the behavior of large language models. By decoupling pre-training and fine-tuning, researchers can gain deeper insights into how these stages contribute to model performance, leading to more efficient and effective LLMs.