MiniMax Teases M3 Model with Sparse Attention for 15.6x Faster Long-Context Decoding

Models & Research

The Engineer

3 Jun 2026 · 3 min read

Chinese AI powerhouse MiniMax is set to revolutionize long-context decoding with its upcoming M3 model, boasting a custom sub-quadratic framework and a significant speed boost.

Among the many Chinese AI companies vying for global market share, MiniMax stands out for its commitment to providing cutting-edge intelligence across various modalities. The company's Hailuo series, for instance, excels in video generation under permissive open-source licenses. Now, MiniMax is raising the bar once again with a detailed technical report on its M2 series of language models and a sneak peek at the upcoming M3 model.

The M2 series, which includes the popular M2, M2.5, and M2.7 models, has consistently achieved top benchmarks in open-source AI performance. Despite being eclipsed by other Chinese labs like DeepSeek and Xiaomi, MiniMax's new report offers valuable insights into its engineering innovations and design approaches.

Sparse Attention and Sub-Quadratic Framework

The core of the M3 model is a novel sparse attention mechanism that significantly accelerates decoding for long contexts. According to MiniMax, this approach can boost response speed by up to 15.6 times at one million tokens, making ultra-long-context AI agent deployment economically viable.

Key Technical Details:

Sparse Attention Mechanism: The M3 model leverages a custom sub-quadratic framework to reduce the computational complexity of attention operations. This is crucial for handling long sequences efficiently.
Parameter Activation: While the foundational backbone houses 229.9 billion total parameters, only 9.8 billion parameters are activated per token. This lean operational footprint ensures that the model remains efficient and scalable.
Mixture-of-Experts (MoE) Decoder: The M2 series relies on a sparse MoE decoder-only Transformer layout, which is a common architecture in state-of-the-art large language models (LLMs).

Adina Yakup of Hugging Face noted on X, "Beyond the benchmarks, they’ve done some really solid work on MoE efficiency and agent-oriented design. Excited to see where M3 goes next!"

Key Takeaways

Performance Boost: The M3 model's sparse attention mechanism can speed up decoding by 15.6 times for long contexts, making it ideal for applications that require processing extensive text sequences.
Efficiency and Scalability: By activating only a fraction of the total parameters, MiniMax ensures that the M3 model remains computationally efficient, even with a large parameter count.
Open Source Insights: The technical report on the M2 series provides valuable insights into MoE efficiency and agent-oriented design, which can be leveraged by enterprises to improve their AI models.

The upcoming M3 model from MiniMax is poised to set new standards in long-context decoding. With its innovative sparse attention mechanism and efficient parameter activation, the M3 model offers a compelling solution for businesses looking to deploy advanced AI agents at scale.

Tags

minimaxai-researchneural networkssparse attentionm3-model

Original Sources

MiniMax teases M3 model with new sparse attention mechanism, 15.6X long-context response speed boost

venturebeat.com· @venturebeat· 27 May 2026

↗ https://venturebeat.com/technology/minimax-teases-upcoming-m3-model-with-new-sparse-attention-mechanism-and-15-6x-response-speed-boost

About the author

The Engineer

Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.