
Share
FlashPack slashes the time PyTorch models take to load onto GPUs, boosting performance for those without GPU Direct Storage and keeping hardware utilization at peak efficiency.
When it comes to deploying machine learning models in production, performance isn’t just about how efficiently your GPUs can crunch numbers; it’s also about how quickly you can get your model up and running. Every second spent waiting for a checkpoint to load is a second your GPUs sit idle instead of delivering value to users.
That's why the team at Fal.ai introduced FlashPack, a new, high-throughput file format and loading mechanism for PyTorch that significantly accelerates model checkpoint I/O, even on systems without access to GPU Direct Storage (GDS).
If you’ve ever waited 30 seconds or more for a large model to load, you’re familiar with the pain of slow model I/O. Most checkpoints, whether in .pt or .safetensors formats, store each weight tensor as distinct objects in memory. When loading these models:
This stop-and-go pipeline is full of synchronization points and unnecessary overhead. While .safetensors improved this by reducing load times, FlashPack takes things even further.
FlashPack rethinks checkpoint loading from the ground up. It’s built on a few key observations:
FlashPack takes the model’s entire state_dict and flattens it into a single, contiguous stream of bytes. At the end of the file, it stores a compact weight map that knows where every parameter and buffer lives, its key, shape, and offset. This is like creating a single, perfectly indexed file instead of thousands of tiny ones.

When it’s time to load, FlashPack doesn’t do a slow read() loop. Instead, it memory-maps the file and divides it into a few mid-sized CPU buffers (≤64MB each). These buffers are loaded in a round-robin pattern, keeping disk reads continuous and efficient.
Each CPU buffer is paired with a dedicated CUDA stream. As soon as one buffer is filled, it’s flushed asynchronously to the GPU, no waiting. While one stream writes to VRAM, another buffer is already being loaded from disk. This overlap ensures that by the time all data is in memory, your model is ready to go.
With FlashPack, loading any model can be 3–6× faster than with current state-of-the-art methods like accelerate or the standard load_state_dict() and to() flow. This speedup is achieved without sacrificing ease of use, FlashPack is a lightweight, pure-Python package that works anywhere.
FlashPack represents a significant step forward in PyTorch model loading. By flattening the state_dict, memory-mapping reads, and overlapping CPU and GPU operations, it achieves remarkable performance gains. Whether you’re deploying models on edge devices or in large-scale data centers, FlashPack can help you get your models running faster and more efficiently.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
27 October 2025
88 articles
Related Articles
Related Articles
More Stories