Mixtral 8x22B: A Sparse Mixture-of-Experts Model with Unmatched Efficiency and Multilingual Capabilities

Models & Research

The Engineer

18 Apr 2024 · 3 min read

Mixtral AI’s new Mixtral 8x22B slashes costs with sparse activation patterns, activating just 39 billion out of 141 billion parameters to deliver top-notch performance in a multilingual setting.

Mistral AI has just released Mixtral 8x22B, their latest open-source model that sets a new standard for performance and efficiency. This sparse Mixture-of-Experts (SMoE) model uses only 39 billion active parameters out of a total of 141 billion, making it highly cost-efficient while delivering top-tier performance.

Technical Breakdown

Sparse Activation Patterns

Mixtral 8x22B leverages sparse activation patterns, which means that only a subset of the model's parameters is activated during inference. This approach significantly reduces computational requirements and makes the model faster than dense models with similar sizes (e.g., 70 billion parameter models). The key benefits include:

Cost Efficiency: Lower hardware and energy costs
Performance-to-Cost Ratio: Best in class for its size

Multilingual Fluency

The model is fluent in multiple languages, including English, French, Italian, German, and Spanish. This multilingual capability is crucial for applications that need to handle diverse user bases or content.

Function Calling

One of the standout features of Mixtral 8x22B is its native ability to call functions. Combined with the constrained output mode on la Plateforme, this enables developers to build sophisticated applications and modernize tech stacks more efficiently.

Large Context Window

With a context window of up to 64,000 tokens, Mixtral 8x22B can process and recall information from large documents with high precision. This is particularly useful for tasks that require understanding long-form content, such as summarization or question-answering.

Truly Open

Mistral AI believes in the power of openness to drive innovation and collaboration. Therefore, Mixtral 8x22B is released under the Apache 2.0 license, which is one of the most permissive open-source licenses. This allows anyone to use, modify, and distribute the model without restrictions.

Efficiency at Its Finest

Mixtral AI has a track record of building models that offer unmatched cost efficiency for their sizes. Mixtral 8x22B continues this tradition by outperforming other dense models while maintaining a lower computational footprint. Here’s how it stacks up:

Faster than Dense Models: Outperforms 70 billion parameter models in terms of speed
More Capable than Other Open Models: Superior performance compared to both permissively and restrictively licensed open models

Unmatched Open Performance

Reasoning and Knowledge

Mixtral 8x22B excels in reasoning tasks, as demonstrated by its performance on various benchmarks:

MMLU (Measuring Massive Multitask Language Understanding)
HellaSwag (10-shot)
WinoGrande (5-shot)
ARC Challenge (5-shot and 25-shot)
TriviaQA (5-shot)
Natural Questions (5-shot)

Multilingual Capabilities

The model's multilingual capabilities are particularly strong, outperforming LLaMA 2 70B on several benchmarks:

HellaSwag
ARC Challenge
MMLU

These benchmarks cover French, German, Spanish, and Italian.

Maths & Coding

In addition to its linguistic prowess, Mixtral 8x22B excels in coding and mathematics tasks. It outperforms other open models on popular benchmarks for these domains.

Conclusion

Mixtral 8x22B is a significant step forward in the development of efficient and capable AI models. Its sparse activation patterns, multilingual fluency, native function calling, and large context window make it a versatile tool for a wide range of applications. The model's open-source nature under Apache 2.0 ensures that it can be widely adopted and adapted by developers and researchers alike.