
Share
PrfaaS breaks the barriers of traditional PD architectures by enabling efficient cross-datacenter KVCache transport, unlocking hybrid attention models for wider deployment without high-bandwidth network constraints.
In the world of large-scale language model (LLM) serving, prefill-decode (PD) disaggregation has become a standard practice. This approach separates the compute-intensive prefill phase from the memory-bandwidth-intensive decode phase, allowing each to be optimized independently. However, traditional PD architectures are constrained by the need for high-bandwidth networks to transfer large KVCache data between these phases, limiting their deployment to within a single datacenter.
A team of researchers from Moonshot AI and Tsinghua University has introduced Prefill-as-a-Service (PrfaaS), a novel architecture that enables cross-datacenter PD serving. PrfaaS selectively offloads long-context prefill tasks to specialized compute-dense clusters and transfers the resulting KVCache over commodity Ethernet to local decode clusters. This design addresses several challenges, including bursty workloads, skewed request lengths, uneven prefix cache distribution, and fluctuating inter-cluster bandwidth.
Hybrid Attention Models: Recent advancements in hybrid attention architectures have significantly reduced the size of KVCache data. Unlike conventional dense-attention models, which generate large amounts of KVCache traffic during prefill, hybrid models produce much smaller KVCache outputs. This reduction is crucial because it makes cross-datacenter transport feasible.
Selective Offloading: PrfaaS does not simply externalize all prefill tasks to remote clusters. Instead, it selectively offloads long-context prefill tasks, which are more compute-intensive and generate larger KVCache data. Shorter contexts can be handled locally, reducing the overall bandwidth requirements and improving efficiency.
Bandwidth-Aware Scheduling: To manage the variability in inter-cluster bandwidth, PrfaaS employs a bandwidth-aware scheduling mechanism. This ensures that prefill tasks are assigned to clusters based on current network conditions, minimizing congestion and maximizing resource utilization.
Cache-Aware Request Placement: The system also considers the distribution of prefix caches when placing requests. By intelligently routing requests to clusters with relevant cache data, PrfaaS reduces the need for redundant KVCache transfers and improves overall performance.

Architecture:
KVCache Efficiency:
Performance Benchmarks:
For practitioners and system architects, the introduction of PrfaaS offers several key benefits:
PrfaaS represents a significant step forward in the deployment of large-scale LLMs. By addressing the limitations of traditional PD architectures, it enables more flexible, cost-effective, and high-performance cross-datacenter deployments. For organizations looking to scale their LLM serving capabilities, PrfaaS provides a promising solution.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
20 April 2026
133 articles
Related Articles
Related Articles
More Stories