
Share
Researchers explore innovative hardware solutions like high-bandwidth flash and processing-near-memory to overcome memory and interconnect bottlenecks in running large language models for inference.
In a recent paper, "Challenges and Research Directions for Large Language Model Inference Hardware," authors Xiaoyu Ma and David Patterson delve into the unique challenges of running large language models (LLMs) in inference mode. Unlike training, where compute is often the bottleneck, inference primarily struggles with memory and interconnect issues. This article highlights four key architectural research opportunities to tackle these challenges: High Bandwidth Flash, Processing-Near-Memory, 3D memory-logic stacking, and low-latency interconnects.
LLM inference is fundamentally different from training due to the autoregressive decode phase of the Transformer model. This phase requires sequential processing, where each token prediction depends on the previous tokens, making it hard to parallelize effectively. As a result, the primary bottlenecks are memory capacity and bandwidth, as well as communication latency between components.
One promising solution is High Bandwidth Flash (HBF), which aims to provide 10 times the memory capacity of current solutions while maintaining high-bandwidth memory (HBM) levels. HBF could be a game-changer for LLMs by significantly reducing the need for off-chip memory access, thus lowering latency and power consumption.
Processing-Near-Memory (PNM) architectures bring compute closer to where data is stored, reducing the need for data movement. This approach can significantly improve memory bandwidth and reduce latency, making it ideal for LLM inference.

3D memory-logic stacking involves vertically integrating logic and memory layers to create a more compact and efficient chip. This technique can drastically improve memory bandwidth and reduce latency, making it particularly useful for LLMs that require massive amounts of data processing.
Finally, a low-latency interconnect is crucial for speeding up communication between different components of the system. This is especially important in distributed inference setups where multiple nodes need to communicate efficiently.
While the focus of this research is on datacenter AI, these architectural advancements could also benefit mobile devices. For instance, PNM and 3D stacking can be adapted to create more efficient and powerful mobile processors, enabling advanced LLM capabilities on edge devices.
The challenges of LLM inference are significant, but emerging hardware architectures like High Bandwidth Flash, Processing-Near-Memory, 3D memory-logic stacking, and low-latency interconnects offer promising solutions. By addressing the primary bottlenecks of memory capacity, bandwidth, and communication latency, these innovations can pave the way for more efficient and powerful LLM inference systems.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
26 January 2026
133 articles
Related Articles
Related Articles
More Stories