FlashPack: Revolutionizing PyTorch Model Loading for Faster GPU Performance

Tools & Engineering

The Engineer

27 Oct 2025 · 4 min read

FlashPack slashes the time PyTorch models take to load onto GPUs, boosting performance for those without GPU Direct Storage and keeping hardware utilization at peak efficiency.

When it comes to deploying machine learning models in production, performance isn’t just about how efficiently your GPUs can crunch numbers; it’s also about how quickly you can get your model up and running. Every second spent waiting for a checkpoint to load is a second your GPUs sit idle instead of delivering value to users.

That's why the team at Fal.ai introduced FlashPack, a new, high-throughput file format and loading mechanism for PyTorch that significantly accelerates model checkpoint I/O, even on systems without access to GPU Direct Storage (GDS).

The Current Landscape

If you’ve ever waited 30 seconds or more for a large model to load, you’re familiar with the pain of slow model I/O. Most checkpoints, whether in .pt or .safetensors formats, store each weight tensor as distinct objects in memory. When loading these models:

The CPU reads a chunk from disk.
Data is moved into RAM.
The CPU sends it to GPU memory.
This process repeats thousands of times in series.

This stop-and-go pipeline is full of synchronization points and unnecessary overhead. While .safetensors improved this by reducing load times, FlashPack takes things even further.

How FlashPack Works

FlashPack rethinks checkpoint loading from the ground up. It’s built on a few key observations:

Uniform Data Types: Most models use the same data type throughout (e.g., float16 or bfloat16).
Efficient Tensor Reshaping: Tensor reshaping is an O(1) operation, it doesn’t require copying data.
Parallel Processing: The CPU and GPU can work in parallel if given the right structure.

1. Flatten Everything into One Block

FlashPack takes the model’s entire state_dict and flattens it into a single, contiguous stream of bytes. At the end of the file, it stores a compact weight map that knows where every parameter and buffer lives, its key, shape, and offset. This is like creating a single, perfectly indexed file instead of thousands of tiny ones.

2. Stream Smartly with Memory-Mapped Reads

When it’s time to load, FlashPack doesn’t do a slow read() loop. Instead, it memory-maps the file and divides it into a few mid-sized CPU buffers (≤64MB each). These buffers are loaded in a round-robin pattern, keeping disk reads continuous and efficient.

3. Overlap Disk, CPU, and GPU with CUDA Streams

Each CPU buffer is paired with a dedicated CUDA stream. As soon as one buffer is filled, it’s flushed asynchronously to the GPU, no waiting. While one stream writes to VRAM, another buffer is already being loaded from disk. This overlap ensures that by the time all data is in memory, your model is ready to go.

Performance Gains

With FlashPack, loading any model can be 3–6× faster than with current state-of-the-art methods like accelerate or the standard load_state_dict() and to() flow. This speedup is achieved without sacrificing ease of use, FlashPack is a lightweight, pure-Python package that works anywhere.

Implementation Details

File Format: The FlashPack file format is designed to be compact and efficient. It stores all model weights in a single contiguous block, followed by a weight map.
Memory Mapping: Memory-mapped reads ensure that data is loaded efficiently from disk without the overhead of multiple read operations.
CUDA Streams: By using CUDA streams, FlashPack overlaps I/O and computation, maximizing GPU utilization.

Conclusion

FlashPack represents a significant step forward in PyTorch model loading. By flattening the state_dict, memory-mapping reads, and overlapping CPU and GPU operations, it achieves remarkable performance gains. Whether you’re deploying models on edge devices or in large-scale data centers, FlashPack can help you get your models running faster and more efficiently.