
Share
SRAM-centric chips are disrupting AI inference with faster speeds and lower latencies than GPUs, as seen in deals like NVIDIA's $20B Groq license. Yet, they come with distinct tradeoffs that challenge traditional performance metrics.
In the rapidly evolving world of AI inference, SRAM-centric accelerators like those from Cerebras, Groq, and d-Matrix are making significant waves. Recent developments, such as NVIDIA's $20B licensing deal for Groq's IP and Cerebras' 750 MW deal with OpenAI, highlight the growing importance of these architectures. These chips promise substantial improvements in latency and throughput compared to traditional GPUs, but understanding their unique characteristics and tradeoffs is crucial for practitioners.
At the heart of the difference between SRAM-centric accelerators and GPUs lies the choice of memory technology. SRAM (Static Random-Access Memory) and HBM (High Bandwidth Memory) are both used to store data, but they have distinct properties that impact performance:
The primary architectural difference between SRAM-centric accelerators and GPUs is the placement of memory relative to compute cores:
Arithmetic intensity, defined as the ratio of compute operations to memory accesses, plays a critical role in determining which architecture performs better:

At Gimlet, we operate a multi-silicon inference cloud that leverages both traditional GPUs and SRAM-centric accelerators. Our software dynamically slices and maps inference workloads to the most suitable hardware based on the workload's characteristics. This approach has provided us with valuable insights:
As the industry continues to evolve, we expect new memory designs to emerge that bridge some of the gaps between SRAM and HBM. These innovations could lead to hybrid architectures that offer the best of both worlds, combining the speed of near-compute memory with the capacity of far-compute memory.
SRAM-centric accelerators are gaining traction in AI inference due to their ability to deliver low latency and high throughput for tasks with high arithmetic intensity. While GPUs remain a strong option for workloads with lower arithmetic intensity, understanding the tradeoffs between these architectures is essential for making informed decisions. As new memory technologies develop, we anticipate further convergence and innovation in this space.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
9 March 2026
133 articles
Related Articles
Related Articles
More Stories