FlashDrive: Optimizing Vision-Language-Action Inference for Real-Time Autonomous Driving

Models & Research

The Engineer

21 Apr 2026 · 3 min read

FlashDrive slashes the time it takes for autonomous vehicles to make decisions, reducing latency from 716ms to just 250ms and ensuring real-time performance critical for safety and efficiency on the roads.

In the realm of autonomous driving, traditional systems often separate perception and planning, which can make them brittle when dealing with rare, complex scenarios. Enter Vision-Language-Action (VLA) models, which integrate chain-of-thought reasoning to handle novel situations more effectively. NVIDIA's Alpamayo 1.5 is a notable example, but its latency of 716ms per step on an NVIDIA RTX PRO 6000 (1.4 Hz) falls short of real-time requirements for safe driving. This is where FlashDrive comes in-a framework designed to reduce end-to-end latency by optimizing all four stages of VLA inference: encode, prefill, decode, and action generation.

The Bottleneck Is Everywhere

A typical VLA model's inference pipeline consists of the following stages:

Vision Encoding: Converting raw sensor data into a meaningful representation.
Prompt Prefilling: Preparing the context for reasoning.
Reasoning Token Decoding: Generating step-by-step reasoning and trajectory predictions.
Action Generation: Producing control commands via flow matching.

When we profiled Alpamayo 1.5, we found that latency is distributed across all four stages:

Encode: 88ms
Prefill: 177.2ms
Decode: 263.8ms
Action: 187.4ms

Total: 716ms

The decode and action stages together account for nearly two-thirds of the total latency, but the encode and prefill stages are significant enough that optimizing just one stage isn't sufficient to meet real-time requirements.

Streaming Inference for Continuous Video Streams

Unlike a chatbot VLM that processes a single image per request, a driving VLA must handle a continuous multi-camera video stream. At each step, the model processes a sliding window of temporal frames from multiple camera views (e.g., 4 frames × 4 views). However, consecutive time steps overlap by 75%: three out of four frames are identical. Re-encoding the full window from scratch every step is computationally wasteful.

FlashDrive introduces a streaming inference strategy that processes only the new frames and reuses the encoded features of the overlapping frames. This approach significantly reduces redundant computation:

Streaming Vision Encoding: Instead of encoding all frames, FlashDrive encodes only the new frames and updates the feature maps incrementally.
Streaming Prefill: The context is updated with the new frame's features without recomputing the entire prefill.
Efficient Decode: By maintaining a rolling buffer of encoded features, the model can decode reasoning tokens more efficiently.

Key Optimizations in FlashDrive

To achieve a 4.5× speedup with negligible accuracy loss, FlashDrive employs several key optimizations:

Memory Management: Efficiently managing GPU memory to reduce data transfer times.
Parallel Processing: Leveraging parallelism across multiple GPUs to distribute the workload.
Algorithmic Improvements: Refining the algorithms used in each stage to minimize computational overhead.

Performance Benchmarks

On an NVIDIA RTX PRO 6000, FlashDrive reduces the end-to-end latency from 716ms to 159ms:

Encode: 24ms
Prefill: 38.2ms
Decode: 63.8ms
Action: 33.4ms

Total: 159ms

This 4.5× speedup brings the system much closer to real-time performance, making it suitable for safe and reliable autonomous driving.

Conclusion

FlashDrive represents a significant step forward in optimizing VLA models for real-time autonomous driving. By addressing bottlenecks across all stages of inference and introducing streaming strategies, FlashDrive achieves a dramatic reduction in latency without compromising accuracy. As the field continues to evolve, frameworks like FlashDrive will be crucial in making autonomous systems more robust and responsive.