
Share
As LLM inference engines evolve to support more complex agentic tasks, traditional benchmarking methods fall short, failing to account for the intricate multi-turn interactions and context management required in advanced applications.
Large language model (LLM) inference engines have traditionally been benchmarked using simple, single-turn workloads. These benchmarks typically involve sending a prompt of P tokens and generating D tokens, measuring metrics like time-to-first-token, tokens-per-second, and completion throughput. Examples include 1k/8k for decode-heavy, 8k/1k for prefill-heavy, and 1k/1k for balanced workloads. However, the rise of agentic applications has introduced a new set of challenges that these traditional benchmarks fail to capture.
Agentic applications, such as multi-turn chatbots, tool-using agents, and interactive coding assistants, operate under a fundamentally different workload pattern. These applications involve multiple turns of interaction, where the model generates responses, calls external tools, receives outputs, and continues generating based on new context. This loop can repeat dozens or even hundreds of times until the agent completes its task.
To address these challenges, Applied Compute has released three distinct workload profiles and an open-source benchmarking harness. These tools aim to provide a more accurate representation of modern agentic workloads, helping developers optimize inference engines and hardware accelerators for real-world performance.

The open-source benchmarking harness allows developers to replay these workload profiles in their own environments. This enables more realistic load testing and performance optimization, ensuring that inference engines can handle the complexities of agentic applications.
Understanding the right metrics is crucial for optimizing inference engines in various deployment contexts:
Over 100 production multi-turn post-training runs sampled from different deployments, Applied Compute observed a mean:
The shift towards agentic workloads has introduced new complexities in LLM inference. Traditional benchmarks are no longer sufficient to capture these nuances. By using the new workload profiles and benchmarking harness, developers can better understand and optimize their inference engines for real-world applications. This will ultimately lead to more efficient, responsive, and reliable agentic systems.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
23 April 2026
133 articles
Related Articles
Related Articles
More Stories