Large-Scale Model Merging: Insights and Best Practices for Combining Expert Models

Models & Research

The Engineer

9 Oct 2024 · 4 min read

Researchers explore the complexities of merging large-scale AI models, revealing how factors like base model quality and expert count influence performance, filling a crucial gap in existing research.

In a recent study, researchers from various institutions have delved into the nuances of model merging at scale. The paper, "What Matters for Model Merging at Scale?" by Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal Faruqui, Mohit Bansal, and Tsendsuren Munkhdalai, explores how different factors like base model quality, expert model count, and merging methods impact the performance of merged models. This work is significant because it addresses a critical gap in the literature: most previous studies have focused on merging a few small models, leaving many questions unanswered about scaling up.

Key Findings

Base Model Quality: Merging is more effective when experts are created from strong base models. These are models that already perform well on zero-shot tasks.
Model Size: Larger models facilitate easier and more effective merging. This suggests that the benefits of model merging become more pronounced as you scale up.
Generalization: Merging consistently improves generalization capabilities, especially when combining multiple large expert models.
Expert Model Count: More expert models can be better merged when using larger base models.
Merging Methods: Different methods (Averaging, Task Arithmetic, Dare, and TIES) behave similarly at larger scales, indicating that the choice of method becomes less critical as you scale.

Experimental Setup

The researchers conducted experiments by merging fully fine-tuned models using four popular merging methods:

Averaging: A simple method where the parameters of the expert models are averaged.
Task Arithmetic: Combines models based on their task-specific performance.
Dare: Uses a dynamic reweighting scheme to balance contributions from different experts.
TIES: A more complex method that involves iterative ensemble selection.

They experimented with model sizes ranging from 1B to 64B parameters and merged up to 8 different expert models. The evaluation was done on both held-in tasks (tasks the experts were trained on) and zero-shot generalization to unseen held-out tasks.

Results

Base Model Quality: Strong base models, those with good zero-shot performance, led to more effective merging. This is crucial because it suggests that starting with a robust foundation can significantly enhance the merged model's capabilities.
Model Size: Larger models (64B parameters) were easier to merge and resulted in better performance compared to smaller models (1B parameters). This finding is particularly important for organizations looking to scale their merging efforts.
Generalization: Merging multiple large expert models often resulted in better generalization capabilities than multitask trained models. This is a significant advantage, especially in scenarios where zero-shot performance on unseen tasks is critical.
Expert Model Count: The study found that more expert models could be effectively merged when using larger base models. For example, merging 8 large expert models was feasible and yielded better results compared to fewer experts or smaller models.
Merging Methods: At larger scales (64B parameters), the performance of different merging methods became more consistent. This implies that while method choice is important for small models, it becomes less critical as you scale up.

Implications

This study provides valuable insights into the best practices for model merging at scale. For practitioners, the key takeaways are:

Start with a Strong Base Model: Ensure your base model has good zero-shot performance to maximize the benefits of merging.
Scale Up When Possible: Larger models not only facilitate easier merging but also lead to better generalization and overall performance.
Combine Multiple Experts: Merging more expert models can enhance the merged model's capabilities, especially when using larger base models.
Choose Any Merging Method for Large Models: At larger scales, different methods perform similarly, giving you flexibility in method selection.

Conclusion

The findings of this study are a significant step forward in understanding the dynamics of model merging at scale. They provide practical guidance for researchers and practitioners looking to leverage the benefits of merging multiple expert models. As the field continues to evolve, these insights will serve as a valuable reference point for future research and development.