
Share
FlashDrive slashes the time it takes for autonomous vehicles to make decisions, reducing latency from 716ms to just 250ms and ensuring real-time performance critical for safety and efficiency on the roads.
In the realm of autonomous driving, traditional systems often separate perception and planning, which can make them brittle when dealing with rare, complex scenarios. Enter Vision-Language-Action (VLA) models, which integrate chain-of-thought reasoning to handle novel situations more effectively. NVIDIA's Alpamayo 1.5 is a notable example, but its latency of 716ms per step on an NVIDIA RTX PRO 6000 (1.4 Hz) falls short of real-time requirements for safe driving. This is where FlashDrive comes in-a framework designed to reduce end-to-end latency by optimizing all four stages of VLA inference: encode, prefill, decode, and action generation.
A typical VLA model's inference pipeline consists of the following stages:
When we profiled Alpamayo 1.5, we found that latency is distributed across all four stages:
Total: 716ms
The decode and action stages together account for nearly two-thirds of the total latency, but the encode and prefill stages are significant enough that optimizing just one stage isn't sufficient to meet real-time requirements.
Unlike a chatbot VLM that processes a single image per request, a driving VLA must handle a continuous multi-camera video stream. At each step, the model processes a sliding window of temporal frames from multiple camera views (e.g., 4 frames × 4 views). However, consecutive time steps overlap by 75%: three out of four frames are identical. Re-encoding the full window from scratch every step is computationally wasteful.
FlashDrive introduces a streaming inference strategy that processes only the new frames and reuses the encoded features of the overlapping frames. This approach significantly reduces redundant computation:

To achieve a 4.5× speedup with negligible accuracy loss, FlashDrive employs several key optimizations:
On an NVIDIA RTX PRO 6000, FlashDrive reduces the end-to-end latency from 716ms to 159ms:
Total: 159ms
This 4.5× speedup brings the system much closer to real-time performance, making it suitable for safe and reliable autonomous driving.
FlashDrive represents a significant step forward in optimizing VLA models for real-time autonomous driving. By addressing bottlenecks across all stages of inference and introducing streaming strategies, FlashDrive achieves a dramatic reduction in latency without compromising accuracy. As the field continues to evolve, frameworks like FlashDrive will be crucial in making autonomous systems more robust and responsive.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
21 April 2026
133 articles
Related Articles
Related Articles
More Stories