GPT-5.5 Surprises with Top Score in Agents' Last Exam Benchmark

Models & Research

The Engineer

15 Jun 2026 · 2 min read

In a surprising turn of events, OpenAI's GPT-5.5 outperforms Anthropic's Claude Fable 5 on the rigorous new ALE benchmark, designed to measure real-world professional workflows.

Researchers from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI) have launched Agents' Last Exam (ALE), a demanding new benchmark aimed at assessing whether AI can execute economically valuable, long-horizon professional tasks. In an unexpected twist, OpenAI’s GPT-5.5, operating through the Codex harness, secured the top spot on the ALE Leaderboard with a 24.0% pass rate. This beat Anthropic's highly anticipated Claude Fable 5 model, which came in third with a score of 22.0%.

The New Benchmark: ALE

Unlike traditional benchmarks that test models on isolated coding puzzles or narrow text-based environments, ALE is designed to bridge the gap between academic hype and real-world labor impact. The benchmark evaluates AI models across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate).

Brain (Reasoning): The model's ability to reason through complex problems and make logical decisions.
Eyes (Visual Perception): Visual recognition capabilities, essential for navigating graphical user interfaces (GUIs).
Body (Orchestration): Coordinating tasks across multiple tools and environments.
Hands (Tool Invocation): Executing specific commands or actions within software applications.
Feet (Runtime Substrate): The underlying infrastructure that supports the model’s operations.

To pass ALE, an agent must navigate Linux or Windows virtual machines, interleaving shell scripting with point-and-click operations inside heavy desktop software. This comprehensive evaluation ensures that models are not just executing terminal commands but can handle a wide range of tasks in a realistic professional setting.

Key Takeaways

GPT-5.5's Success: OpenAI’s GPT-5.5, released in April, demonstrated strong performance on ALE, achieving a 24.0% pass rate.
Claude Fable 5's Performance: Anthropic’s Claude Fable 5, launched just the day before, scored 22.0%, placing third.
Benchmark Rigor: ALE is designed to be more rigorous than previous benchmarks, focusing on real-world professional workflows rather than isolated tasks.
Evaluation Architecture: The GCUA framework ensures that models are evaluated across multiple functional layers, making it harder for them to "cheat" or rely on shortcuts.

The success of GPT-5.5 on ALE highlights the model's versatility and robustness in handling complex, multi-step tasks. While Anthropic’s Claude Fable 5 is a powerful model, its performance suggests that there are still significant challenges in achieving true agentic capabilities in AI systems.

As the field continues to evolve, benchmarks like ALE will play a crucial role in driving progress and ensuring that AI models can deliver real-world value.

Tags

gpt-5.5claude-fable-5benchmark-testai-model-comparisonagents-challenge

Original Sources

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

venturebeat.com· @venturebeat· 10 June 2026

↗ https://venturebeat.com/technology/surprise-upset-gpt-5-5-beats-claude-fable-5-on-brutal-new-agents-last-exam-benchmark

About the author

The Engineer

Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.