SpecBundle & SpecForge v0.2: Advancing Speculative Decoding for Production-Grade LLMs

Models & Research

The Engineer

24 Dec 2025 · 3 min read

SpecBundle and SpecForge v0.2 introduce production-ready checkpoints and enhanced speculative decoding techniques, boosting the performance and practicality of large language models in real-world applications.

The SpecForge team, in collaboration with Ant Group, Meituan, Nex-AGI, and EigenAI, has released SpecBundle (Phase 1) and SpecForge v0.2 to address key challenges in speculative decoding for large language models (LLMs). These updates aim to improve the availability, performance, and production readiness of speculative decoding techniques.

What Changed Technically?

SpecBundle (Phase 1)

Production-Grade Checkpoints: SpecBundle includes a collection of EAGLE-3 model checkpoints trained on large-scale datasets. These models are specifically tuned for real-world applications.
Instruct-Tuned Models: Phase 1 focuses on instruct-tuned models, which are better suited for tasks requiring precise and context-aware responses.

SpecForge v0.2

Extensive Refactoring: The framework has undergone significant refactoring to enhance usability and maintainability.
Multi-Backend Support: SpecForge now supports multiple execution backends, including CPU, GPU, and specialized hardware accelerators, improving scalability and performance across different environments.
Enhanced Usability: User-friendly APIs and better documentation make it easier for developers to integrate speculative decoding into their workflows.

Why It Matters

Technical Advancements

Latency Reduction: Speculative decoding uses a lightweight draft model to propose multiple tokens, which are then verified by a stronger target model. This approach can significantly reduce decoding latency without sacrificing output quality.
Token Acceptance Rate: EAGLE3, the state-of-the-art (SOTA) method used in SpecBundle, has demonstrated strong theoretical guarantees and empirical gains in token acceptance rate and end-to-end speedup.

Industry Collaboration

Diverse Contributions: The collaboration with industry leaders ensures that SpecBundle and SpecForge are optimized for a wide range of use cases and deployment environments.
Community Support: By making these tools open-source, the project encourages further contributions and innovations from the broader community.

Existing Challenges

Despite the potential benefits, speculative decoding has faced several challenges:

Lack of Production-Ready Tooling:
- Most existing implementations are research prototypes that lack robust system-level optimization.
- These tools often struggle to support diverse model architectures and scales used in modern LLMs.
Limited Availability of High-Quality Draft Models:
- Effective speculative decoding relies on strong draft models, which are scarce in the open community.
- Publicly available EAGLE3 checkpoints are primarily from the original authors, limiting broader adoption.

Implementation Details

EAGLE3 Model Architecture: EAGLE3 is designed to balance computational efficiency and accuracy. It uses a two-stage process where a lightweight draft model generates token proposals, which are then verified by a more powerful target model.
Training Process: The instruct-tuned models in SpecBundle are trained using large-scale datasets to ensure they can handle a wide range of tasks and contexts.
Performance Benchmarks:
- Latency: Initial benchmarks show up to 50% reduction in decoding latency compared to traditional methods.
- Token Acceptance Rate: EAGLE3 models achieve an average token acceptance rate of 85%, significantly higher than earlier speculative decoding techniques.

Conclusion

SpecBundle and SpecForge v0.2 represent significant strides in making speculative decoding a viable and practical solution for production-grade LLMs. By addressing the key challenges of tooling and model availability, these updates pave the way for broader adoption and innovation in the field.