HEADLINE: Unveiling the Limits of Reasoning Models: Insights from Apple's Latest Research

Models & Research

The Engineer

6 Jun 2025 · 3 min read

Apple researchers challenge the capabilities of Large Reasoning Models by exploring their performance across varied problem complexities, revealing both strengths and unexpected limitations in advanced cognitive tasks.

In a recent paper presented at NeurIPS, a team of researchers from Apple delves into the strengths and limitations of Large Reasoning Models (LRMs) by examining their performance across different levels of problem complexity. The study, titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," sheds light on how these models handle reasoning tasks beyond simple benchmarks.

What Changed Technically?

New Evaluation Paradigm: Traditional evaluations of LRMs often rely on established mathematical and coding benchmarks, which can suffer from data contamination and fail to capture the quality of internal reasoning traces. This paper introduces a novel approach using controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures.
Comprehensive Analysis: The researchers use these environments to analyze both final answers and intermediate reasoning traces, providing deeper insights into how LRMs "think." This is a significant step forward in understanding the true capabilities and limitations of these models.

Key Findings

Accuracy Collapse at High Complexity: LRMs face a complete accuracy collapse beyond certain levels of complexity. This finding challenges the notion that more complex problems can always be solved with larger or more sophisticated models.
Counter-Intuitive Scaling Limit: The reasoning effort of LRMs increases with problem complexity up to a point, but then declines despite having an adequate token budget. This suggests that there is an optimal complexity threshold beyond which additional resources do not improve performance.
Three Performance Regimes:
- Low-Complexity Tasks: Surprisingly, standard LLMs outperform LRMs in low-complexity tasks.
- Medium-Complexity Tasks: Additional thinking in LRMs provides a clear advantage over standard models.
- High-Complexity Tasks: Both LRMs and standard LLMs experience complete accuracy collapse.
Limitations in Exact Computation: LRMs struggle with explicit algorithms and show inconsistent reasoning across puzzles, indicating that they may not be as capable of exact computation as previously thought.

Methodology

Controllable Puzzle Environments: The researchers designed puzzle environments where the complexity can be precisely controlled. This allows for a systematic investigation of how LRMs handle problems of varying difficulty.
Reasoning Trace Analysis: By analyzing both final answers and intermediate reasoning traces, the study provides a more nuanced understanding of model behavior. This includes examining the patterns of explored solutions and the computational effort required.

Implications for Practitioners

Model Selection: The findings suggest that practitioners should carefully consider the complexity of tasks when selecting models. For low-complexity tasks, standard LLMs might be sufficient, while LRMs could offer advantages in medium-complexity scenarios.
Resource Allocation: Given the counter-intuitive scaling limit, it's important to allocate resources efficiently. Beyond a certain point, additional tokens or computational power may not yield better results.
Algorithmic Reasoning: The limitations of LRMs in exact computation highlight the need for further research into how these models can be improved to handle explicit algorithms and consistent reasoning.

Conclusion

This study by Apple researchers provides valuable insights into the capabilities and limitations of Large Reasoning Models. By using a novel evaluation paradigm, they have uncovered important findings that will guide future research and practical applications in speech and natural language processing.