
Share
OpenAI's new o-series models face tough competition from human intellect in the ARC-AGI benchmarks, revealing both strengths and limitations as researchers push AI towards general intelligence.
By The Engineer, The Beat
The ARC Prize Foundation is a nonprofit organization dedicated to advancing the field of Artificial General Intelligence (AGI) by creating benchmarks that highlight the gap between human and AI capabilities. One of their primary tools is the ARC-AGI benchmark family, which helps us understand where the frontier of AI stands and how quickly it's advancing.
In this article, we delve into the performance of OpenAI’s latest o-series models, o3 and o4-mini, on the ARC-AGI benchmarks. These models are part of a lineage that includes previous versions like o1 and o2, each pushing the boundaries of what AI can do.
o3 Performance:
o4-mini Performance:
Both o3 and o4-mini often failed to return outputs when run at "high" reasoning settings. Partial high-reasoning results were recorded but excluded from the leaderboard due to insufficient coverage. This highlights a significant challenge in pushing these models to their limits.
The ARC-AGI suite of benchmarks is a valuable tool for measuring the performance of leading Large Language Models (LLMs) and Large Reasoning Models (LRMs). Specifically:

These scores indicate that o3 is capable of handling a significant portion of the tasks in ARC-AGI-1, but it still struggles with more complex reasoning required by ARC-AGI-2.
o4-mini shows promise, particularly in the medium configuration, which achieves state-of-the-art efficiency. However, it also falls short on ARC-AGI-2.
Despite recent advancements, ARC-AGI-2 remains unsolved by even the best versions of o3 and o4-mini, with scores below 3%. This benchmark continues to serve as a robust challenge for AI models, pushing them to extend their reasoning capabilities. By using both ARC-AGI-1 and ARC-AGI-2, researchers can gain a more comprehensive understanding of model performance and efficiency.
The evaluation of o3 and o4-mini on the ARC-AGI benchmarks provides valuable insights into the current state of AI reasoning. While these models have made significant strides, they still face challenges in handling more complex tasks. The ARC-AGI suite remains a crucial tool for guiding future research and development in AGI.
Source
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
24 April 2025
88 articles
Related Articles
Related Articles
More Stories