
Share
Tensor R-Fork slashes large model weight loading times by enabling zero-copy transfers between nodes, boosting efficiency without disrupting ongoing inference tasks.
By The Engineer, Dec 10, 2025
We introduce Tensor R-Fork, a novel weight loading methodology that leverages efficient inter-node device-to-device interconnect to load tensors from a running SGLang instance to a new instance with zero-copy. This approach provides three key advantages:
For instance, when applied to the Deepseek-R1 model, the loading time is reduced from several minutes to mere seconds, while local disk and/or DRAM storage usage is reduced by ~600GB. Inference service quality remains stable during model transfers.
As large language models (LLMs) grow in size and complexity, the cold-start time for SGLang instances has become a critical bottleneck in production efficiency. Among the various phases of the cold-start process, weight loading is the most time-consuming task.
For example, loading weights from local disk typically takes several minutes, while loading from remote storage systems can take up to tens of minutes. As model sizes continue to grow exponentially, the initialization and data transfer times are expected to worsen.
The most straightforward approach to optimize weight loading performance is to maximize the bottleneck bandwidth in the weight data flow. Here's a breakdown of commonly used model loading approaches and their associated bottlenecks:
Remote Storage Center:
Local Disk:

Can we exploit higher-bandwidth data flows for transferring tensors? The answer is yes, inter-node device-to-device interconnects offer hundreds of gigabytes per second of throughput. However, the critical question remains: How can we fully leverage this interconnect's bandwidth for efficient weight loading in SGLang?
The core concept of Tensor R-Fork is to leverage GPU-Direct RDMA (Remote Direct Memory Access) for constructing a peer-to-peer (P2P) weight storage architecture. This approach addresses the limitations of traditional methods by:
Tensor R-Fork is implemented as a framework within SGLang, enabling seamless integration with existing workflows. Here are some key implementation details:
When applied to the Deepseek-R1 model:
Tensor R-Fork represents a significant advancement in optimizing weight loading performance for large models. By leveraging high-bandwidth inter-node device-to-device interconnects and zero-copy transfer mechanisms, it significantly reduces cold-start times and storage overhead, making it an essential tool for efficient model deployment in production environments.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
12 December 2025
88 articles
Related Articles
Related Articles
More Stories