
Share
Apple researchers challenge the capabilities of Large Reasoning Models by exploring their performance across varied problem complexities, revealing both strengths and unexpected limitations in advanced cognitive tasks.
In a recent paper presented at NeurIPS, a team of researchers from Apple delves into the strengths and limitations of Large Reasoning Models (LRMs) by examining their performance across different levels of problem complexity. The study, titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," sheds light on how these models handle reasoning tasks beyond simple benchmarks.
New Evaluation Paradigm: Traditional evaluations of LRMs often rely on established mathematical and coding benchmarks, which can suffer from data contamination and fail to capture the quality of internal reasoning traces. This paper introduces a novel approach using controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures.
Comprehensive Analysis: The researchers use these environments to analyze both final answers and intermediate reasoning traces, providing deeper insights into how LRMs "think." This is a significant step forward in understanding the true capabilities and limitations of these models.
Accuracy Collapse at High Complexity: LRMs face a complete accuracy collapse beyond certain levels of complexity. This finding challenges the notion that more complex problems can always be solved with larger or more sophisticated models.
Counter-Intuitive Scaling Limit: The reasoning effort of LRMs increases with problem complexity up to a point, but then declines despite having an adequate token budget. This suggests that there is an optimal complexity threshold beyond which additional resources do not improve performance.
Three Performance Regimes:
Limitations in Exact Computation: LRMs struggle with explicit algorithms and show inconsistent reasoning across puzzles, indicating that they may not be as capable of exact computation as previously thought.

Controllable Puzzle Environments: The researchers designed puzzle environments where the complexity can be precisely controlled. This allows for a systematic investigation of how LRMs handle problems of varying difficulty.
Reasoning Trace Analysis: By analyzing both final answers and intermediate reasoning traces, the study provides a more nuanced understanding of model behavior. This includes examining the patterns of explored solutions and the computational effort required.
Model Selection: The findings suggest that practitioners should carefully consider the complexity of tasks when selecting models. For low-complexity tasks, standard LLMs might be sufficient, while LRMs could offer advantages in medium-complexity scenarios.
Resource Allocation: Given the counter-intuitive scaling limit, it's important to allocate resources efficiently. Beyond a certain point, additional tokens or computational power may not yield better results.
Algorithmic Reasoning: The limitations of LRMs in exact computation highlight the need for further research into how these models can be improved to handle explicit algorithms and consistent reasoning.
This study by Apple researchers provides valuable insights into the capabilities and limitations of Large Reasoning Models. By using a novel evaluation paradigm, they have uncovered important findings that will guide future research and practical applications in speech and natural language processing.
Tags
Original Sources
↗ https://machinelearning.apple.com/research/illusion-of-thinking?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
6 June 2025
88 articles
Related Articles
Related Articles
More Stories