Jakiro: Enhancing Speculative Decoding with Decoupled Multi-Head via MoE for Faster and More Accurate Inference

Models & Research

The Engineer

14 Feb 2025 · 3 min read

Jakiro tackles speculative decoding's inefficiencies by decoupling multi-head attention with MoE, boosting both speed and accuracy in large model inference without sacrificing quality.

Speculative decoding (SD) is a technique that has gained traction in accelerating the inference of large language models. The idea is to use a smaller, faster draft model to predict multiple tokens ahead, which are then verified by the larger target model. This approach can significantly speed up inference times, but it often faces limitations due to the limited capacity of the draft model and the need for tree-based sampling to improve prediction accuracy.

Key Limitation: Correlated Candidates

One significant limitation of current speculative decoding methods is that multiple candidates at each step are derived from the same representation. This leads to a lack of diversity among predictions, which can reduce overall effectiveness. To tackle this issue, researchers from various institutions have introduced Jakiro, a novel approach that leverages Mixture of Experts (MoE) to generate diverse and independent predictions.

Introducing Jakiro

Jakiro introduces several key innovations to enhance speculative decoding:

Decoupled Multi-Head via MoE: Instead of relying on a single draft model, Jakiro uses multiple independent experts within the MoE framework. Each expert generates its own set of predictions, effectively decoupling correlations among candidates and increasing diversity.
Hybrid Inference Strategy: The method combines autoregressive decoding for initial tokens with parallel decoding for subsequent stages. This hybrid approach ensures that the initial tokens are generated accurately while leveraging parallel processing to speed up the generation of subsequent tokens.
Contrastive Mechanism in Features: To further improve accuracy, Jakiro incorporates a contrastive mechanism during the parallel decoding phase. This mechanism helps in refining predictions by contrasting different feature representations, leading to more accurate and diverse token sequences.

Implementation Details

The architecture of Jakiro is designed to be modular and flexible:

Mixture of Experts (MoE): Each expert in the MoE framework is a smaller model trained to generate token predictions. The experts are dynamically selected based on their suitability for the current context, ensuring that the most appropriate models are used at each step.
Hybrid Decoding: The initial tokens are generated using autoregressive decoding, which ensures high accuracy for the first few tokens. Subsequent tokens are then generated in parallel, leveraging the computational power of modern hardware to speed up the process.
Contrastive Feature Refinement: During the parallel decoding phase, the model uses a contrastive loss function to refine predictions. This involves comparing feature representations from different experts and adjusting them to improve overall accuracy.

Experimental Results

The researchers conducted extensive experiments across various large language models to validate the effectiveness of Jakiro. The results are impressive:

Prediction Accuracy: Jakiro significantly boosts prediction accuracy compared to traditional speculative decoding methods, especially in scenarios where diverse predictions are crucial.
Inference Speedups: The hybrid inference strategy and parallel processing lead to substantial speedups, making Jakiro a compelling choice for real-world applications.

Conclusion

Jakiro represents a significant advancement in speculative decoding by addressing the key limitation of correlated candidates through the use of Mixture of Experts (MoE). By combining autoregressive and parallel decoding with a contrastive feature refinement mechanism, Jakiro achieves both higher accuracy and faster inference speeds. This makes it a promising approach for accelerating large language model inference in various applications.