
Share
Researchers challenge the idea that larger language models yield diminishing returns, showing that small accuracy gains can lead to significant long-term improvements, contrary to what short-term benchmarks suggest.
Large language models (LLMs) have been a focal point in AI research, with continuous scaling leading to impressive gains in various benchmarks. However, a recent paper from Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping challenges the notion that these gains are diminishing as models grow larger. The paper, "The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs," published at ICLR 2026, argues that short-task benchmarks can give a misleading impression of slowing progress. Instead, even marginal improvements in single-step accuracy can compound into significant gains when tasks are extended over multiple steps.
Short-Term vs. Long-Term Performance: The authors demonstrate that while larger models may show only slight improvements in short-term tasks, these incremental gains translate to exponential improvements in long-horizon execution. This is crucial because many real-world applications require models to handle complex, multi-step tasks.
Execution Failures and Self-Conditioning: One of the key insights is that failures in longer tasks are often due to execution errors rather than reasoning limitations. The paper introduces the concept of "self-conditioning," where models become more likely to make mistakes when their context includes errors from previous steps. This effect persists even with larger model sizes.
The researchers conducted a series of experiments to isolate and measure the execution capabilities of LLMs:
Single-Turn Accuracy: They found that while smaller models can achieve near-perfect accuracy in single-turn tasks, larger models significantly outperform them in multi-turn scenarios. This indicates that scaling model size is crucial for handling longer tasks.
Per-Step Accuracy Degradation: As the number of steps increases, the per-step accuracy of models tends to degrade. This degradation is not solely due to long-context limitations but is exacerbated by the self-conditioning effect.

The paper explores methods to mitigate self-conditioning and improve long-horizon execution:
To benchmark the effectiveness of these thinking mechanisms, the researchers tested frontier thinking models on various long-horizon tasks:
By focusing on execution capabilities, the paper provides a new perspective on the benefits of scaling LLMs. It challenges the common belief that larger models are subject to diminishing returns and highlights the massive gains possible in long-horizon tasks. This research not only advances our understanding of LLM limitations but also opens up exciting possibilities for future applications.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
15 September 2025
88 articles
Related Articles
Related Articles
More Stories