Prefill-as-a-Service: Cross-Datacenter KVCache Transport for Hybrid Attention Models

Tools & Engineering

The Engineer

20 Apr 2026 · 3 min read

PrfaaS breaks the barriers of traditional PD architectures by enabling efficient cross-datacenter KVCache transport, unlocking hybrid attention models for wider deployment without high-bandwidth network constraints.

In the world of large-scale language model (LLM) serving, prefill-decode (PD) disaggregation has become a standard practice. This approach separates the compute-intensive prefill phase from the memory-bandwidth-intensive decode phase, allowing each to be optimized independently. However, traditional PD architectures are constrained by the need for high-bandwidth networks to transfer large KVCache data between these phases, limiting their deployment to within a single datacenter.

A team of researchers from Moonshot AI and Tsinghua University has introduced Prefill-as-a-Service (PrfaaS), a novel architecture that enables cross-datacenter PD serving. PrfaaS selectively offloads long-context prefill tasks to specialized compute-dense clusters and transfers the resulting KVCache over commodity Ethernet to local decode clusters. This design addresses several challenges, including bursty workloads, skewed request lengths, uneven prefix cache distribution, and fluctuating inter-cluster bandwidth.

Key Technical Changes and Their Implications

Hybrid Attention Models: Recent advancements in hybrid attention architectures have significantly reduced the size of KVCache data. Unlike conventional dense-attention models, which generate large amounts of KVCache traffic during prefill, hybrid models produce much smaller KVCache outputs. This reduction is crucial because it makes cross-datacenter transport feasible.
Selective Offloading: PrfaaS does not simply externalize all prefill tasks to remote clusters. Instead, it selectively offloads long-context prefill tasks, which are more compute-intensive and generate larger KVCache data. Shorter contexts can be handled locally, reducing the overall bandwidth requirements and improving efficiency.
Bandwidth-Aware Scheduling: To manage the variability in inter-cluster bandwidth, PrfaaS employs a bandwidth-aware scheduling mechanism. This ensures that prefill tasks are assigned to clusters based on current network conditions, minimizing congestion and maximizing resource utilization.
Cache-Aware Request Placement: The system also considers the distribution of prefix caches when placing requests. By intelligently routing requests to clusters with relevant cache data, PrfaaS reduces the need for redundant KVCache transfers and improves overall performance.

Implementation Details

Architecture:
- Prefill Clusters: These are compute-dense clusters optimized for handling long-context prefill tasks.
- Decode Clusters: These clusters focus on the decode phase and are connected to local storage for KVCache data.
- Cross-Datacenter Network: Commodity Ethernet is used to transfer KVCache data between prefill and decode clusters, eliminating the need for expensive high-bandwidth networks.
KVCache Efficiency:
- Hybrid attention models reduce the size of KVCache data by up to several orders of magnitude compared to dense-attention models.
- This efficiency is achieved through a combination of sparse attention mechanisms and optimized cache management techniques.
Performance Benchmarks:
- In a case study using an internal 1T-parameter hybrid model, PrfaaS demonstrated significant improvements in serving throughput.
- The heterogeneous deployment with PrfaaS achieved 54% higher throughput compared to a homogeneous PD setup.
- It also outperformed a naive heterogeneous baseline by 32%, while consuming only modest cross-datacenter bandwidth.

Why This Matters for Practitioners

For practitioners and system architects, the introduction of PrfaaS offers several key benefits:

Resource Elasticity: By decoupling prefill and decode phases, organizations can independently scale these components across different datacenters. This flexibility is crucial for managing varying workloads and optimizing resource utilization.
Cost Efficiency: Using commodity Ethernet for cross-datacenter communication reduces the need for specialized high-bandwidth networks, lowering infrastructure costs.
Improved Throughput: The combination of model-side KV efficiency and system-side optimizations results in higher serving throughput, which can translate to better user experiences and more efficient use of computational resources.

Conclusion

PrfaaS represents a significant step forward in the deployment of large-scale LLMs. By addressing the limitations of traditional PD architectures, it enables more flexible, cost-effective, and high-performance cross-datacenter deployments. For organizations looking to scale their LLM serving capabilities, PrfaaS provides a promising solution.