Understanding and Mitigating Nondeterminism in LLM Inference

Models & Research

The Engineer

11 Sept 2025 · 3 min read

Exploring why even strict deterministic methods like greedy sampling fail to ensure consistency in LLM responses, revealing deeper issues within these complex systems.

Reproducibility is a cornerstone of scientific progress, but achieving it with large language models (LLMs) can be surprisingly challenging. Even when you ask the same question multiple times to an LLM like ChatGPT, the results can vary. This isn't just due to the probabilistic nature of sampling; even deterministic methods like greedy sampling (where the model always picks the highest probability token) don’t guarantee consistent outputs.

Why Greedy Sampling Isn’t Always Deterministic

At first glance, setting the temperature to 0 should make the LLM deterministic because it forces the model to choose the most probable token at each step. However, in practice, this isn't the case. Whether you're using an LLM API or running inference on your own hardware with open-source libraries like vLLM or SGLang, nondeterminism remains a significant issue.

The Concurrency + Floating Point Hypothesis

One common explanation for this nondeterminism is the "concurrency + floating point" hypothesis. This theory suggests that the combination of non-associative floating-point arithmetic and concurrent execution on GPUs leads to different results based on which core finishes its computation first. For instance, a recent arXiv preprint explains:

Floating-point arithmetic in GPUs exhibits non-associativity, meaning (a + b) + c ≠ a + (b + c) due to finite precision and rounding errors. This property directly impacts the computation of attention scores and logits in the transformer architecture, where parallel operations across multiple threads can yield different results based on execution order.

This hypothesis is often repeated by others in the community:

"There are speed tradeoffs, and in order to make the endpoints fast, GPUs are used, which do parallel [nondeterministic] calculations. Any modern GPU neural net calculations will be subject to these." (OpenAI Community)
"Because GPUs are highly parallelized, the ordering of additions or multiplications might be different on each execution, which can cascade into small differences in output." (Twitter)

Beyond Concurrency and Floating Point

While the concurrency + floating point hypothesis is not entirely wrong, it doesn't capture the full picture. Other factors contribute to nondeterminism in LLM inference:

Initial State Variability: The initial state of the model or the random seed used can vary between runs, leading to different results.
Library Implementation Differences: Different libraries and frameworks may handle floating-point operations differently, introducing variability.
Hardware-Specific Behavior: Even within the same type of hardware, slight differences in manufacturing or configuration can lead to non-reproducible results.

Practical Steps for Mitigation

To achieve more consistent results, consider the following strategies:

Set a Fixed Random Seed: Ensure that all random number generators are initialized with the same seed.
Use Deterministic Libraries: Some libraries offer deterministic modes or settings. For example, vLLM and SGLang have options to enforce determinism.
Reduce Parallelism: Running inference in a single-threaded mode can help reduce variability, though it may come at the cost of performance.
Check for Library Updates: Regularly update your libraries to benefit from bug fixes and improvements that enhance reproducibility.

Conclusion

Nondeterminism in LLM inference is a multifaceted issue that extends beyond just concurrency and floating-point arithmetic. By understanding these factors and implementing practical mitigation strategies, practitioners can improve the reproducibility of their models. As research continues, we can expect more robust solutions to emerge, further solidifying the foundation of scientific progress in the field of AI.