OpenAI's o3 and o4-mini Evaluated on ARC-AGI Benchmarks

Models & Research

The Engineer

24 Apr 2025 · 3 min read

OpenAI's new o-series models face tough competition from human intellect in the ARC-AGI benchmarks, revealing both strengths and limitations as researchers push AI towards general intelligence.

By The Engineer, The Beat

The ARC Prize Foundation is a nonprofit organization dedicated to advancing the field of Artificial General Intelligence (AGI) by creating benchmarks that highlight the gap between human and AI capabilities. One of their primary tools is the ARC-AGI benchmark family, which helps us understand where the frontier of AI stands and how quickly it's advancing.

In this article, we delve into the performance of OpenAI’s latest o-series models, o3 and o4-mini, on the ARC-AGI benchmarks. These models are part of a lineage that includes previous versions like o1 and o2, each pushing the boundaries of what AI can do.

Key Findings

o3 Performance:
- ARC-AGI-1:
  - o3-low scored 41% on the ARC-AGI-1 Semi Private Eval set.
  - o3-medium reached 53%.
- ARC-AGI-2:
  - Both o3-low and o3-medium scored under 3%.
o4-mini Performance:
- ARC-AGI-1:
  - o4-mini-low scored 21%.
  - o4-mini-medium reached 41%, achieving state-of-the-art levels of efficiency.
- ARC-AGI-2:
  - Both o4-mini-low and o4-mini-medium scored under 3%.

Incomplete Coverage with High Reasoning

Both o3 and o4-mini often failed to return outputs when run at "high" reasoning settings. Partial high-reasoning results were recorded but excluded from the leaderboard due to insufficient coverage. This highlights a significant challenge in pushing these models to their limits.

ARC-AGI as a Tool for Evaluation

The ARC-AGI suite of benchmarks is a valuable tool for measuring the performance of leading Large Language Models (LLMs) and Large Reasoning Models (LRMs). Specifically:

ARC-AGI-1:
- Provides a broader range of task difficulty.
- Enables direct comparison with previous model versions, such as o3-preview from December 2024.

ARC-AGI-2:
- Introduced in March 2025, it builds on ARC-AGI-1 by introducing more complex tasks that require deeper abstraction and symbolic interpretation.
- Currently serves as a more sensitive tool for evaluating model performance.

Performance Analysis

o3 on ARC-AGI-1

o3-low: 41% accuracy
o3-medium: 53% accuracy

These scores indicate that o3 is capable of handling a significant portion of the tasks in ARC-AGI-1, but it still struggles with more complex reasoning required by ARC-AGI-2.

o4-mini on ARC-AGI-1

o4-mini-low: 21% accuracy
o4-mini-medium: 41% accuracy

o4-mini shows promise, particularly in the medium configuration, which achieves state-of-the-art efficiency. However, it also falls short on ARC-AGI-2.

High Reasoning Settings

Both models frequently failed to return outputs when run at high reasoning settings.
Partial results were recorded but excluded from the leaderboard due to insufficient coverage.

Implications

Despite recent advancements, ARC-AGI-2 remains unsolved by even the best versions of o3 and o4-mini, with scores below 3%. This benchmark continues to serve as a robust challenge for AI models, pushing them to extend their reasoning capabilities. By using both ARC-AGI-1 and ARC-AGI-2, researchers can gain a more comprehensive understanding of model performance and efficiency.

Conclusion

The evaluation of o3 and o4-mini on the ARC-AGI benchmarks provides valuable insights into the current state of AI reasoning. While these models have made significant strides, they still face challenges in handling more complex tasks. The ARC-AGI suite remains a crucial tool for guiding future research and development in AGI.

Source