Accelerating Large Model Weight Loading with Tensor R-Fork

Tools & Engineering

The Engineer

12 Dec 2025 · 4 min read

Tensor R-Fork slashes large model weight loading times by enabling zero-copy transfers between nodes, boosting efficiency without disrupting ongoing inference tasks.

Let Tensors Fly, Accelerating Large Model Weight Loading with R-Fork

By The Engineer, Dec 10, 2025

TL;DR

We introduce Tensor R-Fork, a novel weight loading methodology that leverages efficient inter-node device-to-device interconnect to load tensors from a running SGLang instance to a new instance with zero-copy. This approach provides three key advantages:

Significantly accelerates weight-loading performance
Eliminates redundant model weight storage on local disk and/or DRAM
Ensures non-disturbing operation for inference services

For instance, when applied to the Deepseek-R1 model, the loading time is reduced from several minutes to mere seconds, while local disk and/or DRAM storage usage is reduced by ~600GB. Inference service quality remains stable during model transfers.

Background

As large language models (LLMs) grow in size and complexity, the cold-start time for SGLang instances has become a critical bottleneck in production efficiency. Among the various phases of the cold-start process, weight loading is the most time-consuming task.

For example, loading weights from local disk typically takes several minutes, while loading from remote storage systems can take up to tens of minutes. As model sizes continue to grow exponentially, the initialization and data transfer times are expected to worsen.

Optimizing Weight Loading Performance

The most straightforward approach to optimize weight loading performance is to maximize the bottleneck bandwidth in the weight data flow. Here's a breakdown of commonly used model loading approaches and their associated bottlenecks:

Remote Storage Center:
- Data Flow: Remote storage -> remote Ethernet NIC -> Ethernet -> local Ethernet NIC -> local DRAM -> local GPU memory
- Bottleneck: NVMe/Ethernet NIC
Local Disk:
- Data Flow: Disk -> DRAM -> GPU memory
- Bottleneck: NVMe

Local DRAM:
- Data Flow: DRAM -> GPU memory
- Bottleneck: PCIe

Can we exploit higher-bandwidth data flows for transferring tensors? The answer is yes, inter-node device-to-device interconnects offer hundreds of gigabytes per second of throughput. However, the critical question remains: How can we fully leverage this interconnect's bandwidth for efficient weight loading in SGLang?

Design of Tensor R-Fork

The core concept of Tensor R-Fork is to leverage GPU-Direct RDMA (Remote Direct Memory Access) for constructing a peer-to-peer (P2P) weight storage architecture. This approach addresses the limitations of traditional methods by:

Eliminating Bottlenecks: By using high-bandwidth inter-node device-to-device interconnects, Tensor R-Fork bypasses the lower bandwidth bottlenecks in the data flow.
Zero-Copy Transfer: Tensors are transferred directly from the source GPU to the target GPU without intermediate storage in DRAM or local disk, reducing latency and storage overhead.

Implementation Details

Tensor R-Fork is implemented as a framework within SGLang, enabling seamless integration with existing workflows. Here are some key implementation details:

P2P Interconnect: Utilizes high-speed interconnects such as InfiniBand or NVIDIA NVLink to achieve low-latency and high-bandwidth data transfer.
GPU-Direct RDMA: Ensures that data is transferred directly between GPU memory regions without involving the CPU, reducing overhead and improving performance.
Zero-Copy Mechanism: The framework ensures that tensors are loaded directly into the target GPU's memory, eliminating the need for intermediate storage.

Performance Benchmarks

When applied to the Deepseek-R1 model:

Loading Time: Reduced from several minutes to mere seconds.
Storage Usage: Local disk and/or DRAM storage usage reduced by ~600GB.
Inference Service Quality: Maintained during model transfers, ensuring non-disturbing operation.

Conclusion

Tensor R-Fork represents a significant advancement in optimizing weight loading performance for large models. By leveraging high-bandwidth inter-node device-to-device interconnects and zero-copy transfer mechanisms, it significantly reduces cold-start times and storage overhead, making it an essential tool for efficient model deployment in production environments.