
Share
Exploring why even strict deterministic methods like greedy sampling fail to ensure consistency in LLM responses, revealing deeper issues within these complex systems.
Reproducibility is a cornerstone of scientific progress, but achieving it with large language models (LLMs) can be surprisingly challenging. Even when you ask the same question multiple times to an LLM like ChatGPT, the results can vary. This isn't just due to the probabilistic nature of sampling; even deterministic methods like greedy sampling (where the model always picks the highest probability token) don’t guarantee consistent outputs.
At first glance, setting the temperature to 0 should make the LLM deterministic because it forces the model to choose the most probable token at each step. However, in practice, this isn't the case. Whether you're using an LLM API or running inference on your own hardware with open-source libraries like vLLM or SGLang, nondeterminism remains a significant issue.
One common explanation for this nondeterminism is the "concurrency + floating point" hypothesis. This theory suggests that the combination of non-associative floating-point arithmetic and concurrent execution on GPUs leads to different results based on which core finishes its computation first. For instance, a recent arXiv preprint explains:
Floating-point arithmetic in GPUs exhibits non-associativity, meaning (a + b) + c ≠ a + (b + c) due to finite precision and rounding errors. This property directly impacts the computation of attention scores and logits in the transformer architecture, where parallel operations across multiple threads can yield different results based on execution order.
This hypothesis is often repeated by others in the community:

While the concurrency + floating point hypothesis is not entirely wrong, it doesn't capture the full picture. Other factors contribute to nondeterminism in LLM inference:
To achieve more consistent results, consider the following strategies:
Nondeterminism in LLM inference is a multifaceted issue that extends beyond just concurrency and floating-point arithmetic. By understanding these factors and implementing practical mitigation strategies, practitioners can improve the reproducibility of their models. As research continues, we can expect more robust solutions to emerge, further solidifying the foundation of scientific progress in the field of AI.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
11 September 2025
88 articles
Related Articles
Related Articles
More Stories