
Share
This summer's DeepNVMe upgrade slashes I/O bottlenecks further, adding support for new applications like FastPersist and SGLang while boosting performance with PCIe Gen5 SSDs and enhancing accessibility across different computing setups.
In summer 2024, we introduced DeepNVMe as a suite of optimizations aimed at tackling I/O bottlenecks in deep learning (DL). By leveraging local NVMe SSDs, NVIDIA Magnum IO™ GPUDirect® Storage (GDS), and Linux Asynchronous I/O (AIO), DeepNVMe delivers significant speedups for I/O-bound DL workloads. This update brings several improvements: expanded application coverage to FastPersist model checkpointing and SGLang inference, enhanced I/O performance with PCIe Gen5 NVMe SSDs, and improved usability in CPU-only environments, offset-based I/O operations, and tensor data type casting. These enhancements are available in DeepSpeed versions >= 0.17.1.
Our experiments were conducted on Azure ND-H200-v5 VMs with the following key software configurations:
We used DeepNVMe to develop two key features: FastPersist for model checkpointing and ZeRO-Inference for inference tasks, both targeting I/O bottlenecks in DL training and inference.
Saving model checkpoints is crucial for model training but can be a significant bottleneck due to inefficiencies in existing approaches. FastPersist addresses these performance challenges through three key techniques:
Our goal was to demonstrate the impact of DeepNVMe in FastPersist. We conducted experiments on a single VM, combining available NVMe SSDs into a single RAID-0 volume to leverage aggregate read and write bandwidths. We tested both CPU bounce buffers (AIO) and NVIDIA GPUDirect Storage (GDS) for offloading tensors.

One of the key improvements is the upgrade from PCIe Gen4 to Gen5 NVMe SSDs. This change significantly boosts I/O performance, which is critical for large-scale DL applications. Here are some of the benefits:
DeepNVMe now supports a wider range of applications:
To make DeepNVMe more accessible, we have expanded its usability in several ways:
To quantify the improvements, we conducted benchmarks on various DL workloads. Here are some key findings:
DeepNVMe continues to push the boundaries of I/O performance in deep learning. By leveraging advanced storage technologies and optimizing I/O operations, it addresses critical bottlenecks and enhances both training and inference workflows. These enhancements are available in DeepSpeed versions >= 0.17.1, making them accessible to a wide range of practitioners.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
19 June 2025
88 articles
Related Articles
Related Articles
More Stories