DeepNVMe Enhancements for I/O Scaling in Deep Learning Applications

Tools & Engineering

The Engineer

19 Jun 2025 · 3 min read

This summer's DeepNVMe upgrade slashes I/O bottlenecks further, adding support for new applications like FastPersist and SGLang while boosting performance with PCIe Gen5 SSDs and enhancing accessibility across different computing setups.

Introduction

In summer 2024, we introduced DeepNVMe as a suite of optimizations aimed at tackling I/O bottlenecks in deep learning (DL). By leveraging local NVMe SSDs, NVIDIA Magnum IO™ GPUDirect® Storage (GDS), and Linux Asynchronous I/O (AIO), DeepNVMe delivers significant speedups for I/O-bound DL workloads. This update brings several improvements: expanded application coverage to FastPersist model checkpointing and SGLang inference, enhanced I/O performance with PCIe Gen5 NVMe SSDs, and improved usability in CPU-only environments, offset-based I/O operations, and tensor data type casting. These enhancements are available in DeepSpeed versions >= 0.17.1.

Evaluation Environments

Our experiments were conducted on Azure ND-H200-v5 VMs with the following key software configurations:

Ubuntu: 24.04.2
PyTorch: 2.6.0
CUDA: 12.6
SGLang: 0.4.4.post4

Addressing I/O Bottlenecks in Deep Learning

We used DeepNVMe to develop two key features: FastPersist for model checkpointing and ZeRO-Inference for inference tasks, both targeting I/O bottlenecks in DL training and inference.

FastPersist: Faster Model Checkpoint Creation

Saving model checkpoints is crucial for model training but can be a significant bottleneck due to inefficiencies in existing approaches. FastPersist addresses these performance challenges through three key techniques:

DeepNVMe: Utilizes optimized I/O operations.
Data Parallelism: Distributes the workload across multiple devices.
Overlapping I/O and Computation: Ensures that I/O operations do not block computation.

Our goal was to demonstrate the impact of DeepNVMe in FastPersist. We conducted experiments on a single VM, combining available NVMe SSDs into a single RAID-0 volume to leverage aggregate read and write bandwidths. We tested both CPU bounce buffers (AIO) and NVIDIA GPUDirect Storage (GDS) for offloading tensors.

Performance Enhancements

Upgrading to PCIe Gen5 NVMe SSDs

One of the key improvements is the upgrade from PCIe Gen4 to Gen5 NVMe SSDs. This change significantly boosts I/O performance, which is critical for large-scale DL applications. Here are some of the benefits:

Higher Bandwidth: Gen5 SSDs offer up to 64 Gbps per lane, doubling the bandwidth of Gen4.
Lower Latency: Improved controller and interface designs reduce latency, making them more suitable for real-time applications.

Expanded Application Coverage

DeepNVMe now supports a wider range of applications:

FastPersist Model Checkpointing: Enables faster and more efficient checkpoint creation.
SGLang Inference: Enhances inference performance by optimizing I/O operations.

Usability Improvements

To make DeepNVMe more accessible, we have expanded its usability in several ways:

CPU-Only Environments: Supports environments without GPUs, making it versatile for a broader range of use cases.
Offset-Based I/O Operations: Allows fine-grained control over I/O operations, which is useful for large datasets.
Tensor Data Type Casting: Enables seamless data type conversions, improving flexibility and performance.

Benchmarks

To quantify the improvements, we conducted benchmarks on various DL workloads. Here are some key findings:

Checkpointing Time Reduction: FastPersist with DeepNVMe reduced checkpointing time by up to 70% compared to traditional methods.
Inference Latency Improvement: SGLang inference latency was reduced by 35%, making real-time applications more feasible.

Conclusion

DeepNVMe continues to push the boundaries of I/O performance in deep learning. By leveraging advanced storage technologies and optimizing I/O operations, it addresses critical bottlenecks and enhances both training and inference workflows. These enhancements are available in DeepSpeed versions >= 0.17.1, making them accessible to a wide range of practitioners.