
Share
In a surprising turn of events, OpenAI's GPT-5.5 outperforms Anthropic's Claude Fable 5 on the rigorous new ALE benchmark, designed to measure real-world professional workflows.
Researchers from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI) have launched Agents' Last Exam (ALE), a demanding new benchmark aimed at assessing whether AI can execute economically valuable, long-horizon professional tasks. In an unexpected twist, OpenAI’s GPT-5.5, operating through the Codex harness, secured the top spot on the ALE Leaderboard with a 24.0% pass rate. This beat Anthropic's highly anticipated Claude Fable 5 model, which came in third with a score of 22.0%.
Unlike traditional benchmarks that test models on isolated coding puzzles or narrow text-based environments, ALE is designed to bridge the gap between academic hype and real-world labor impact. The benchmark evaluates AI models across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate).
To pass ALE, an agent must navigate Linux or Windows virtual machines, interleaving shell scripting with point-and-click operations inside heavy desktop software. This comprehensive evaluation ensures that models are not just executing terminal commands but can handle a wide range of tasks in a realistic professional setting.

The success of GPT-5.5 on ALE highlights the model's versatility and robustness in handling complex, multi-step tasks. While Anthropic’s Claude Fable 5 is a powerful model, its performance suggests that there are still significant challenges in achieving true agentic capabilities in AI systems.
As the field continues to evolve, benchmarks like ALE will play a crucial role in driving progress and ensuring that AI models can deliver real-world value.
Tags
Original Sources
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
↗ https://venturebeat.com/technology/surprise-upset-gpt-5-5-beats-claude-fable-5-on-brutal-new-agents-last-exam-benchmark
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
15 June 2026
67 articles
Related Articles

Global Researchers Compete to Shape AI's Future in Organizations
Models & Research · 3 min

MiniMax M3 Challenges GPT-5.5 and Gemini 3.1 Pro with Superior Performance at a Fraction of the Cost
Models & Research · 4 min

OpenAI’s AI Solves Complex Math Problems by Leveraging Pattern Recognition
Models & Research · 4 min
Related Articles

Global Researchers Compete to Shape AI's Future in Organizations
Models & Research · 3 min

MiniMax M3 Challenges GPT-5.5 and Gemini 3.1 Pro with Superior Performance at a Fraction of the Cost
Models & Research · 4 min

OpenAI’s AI Solves Complex Math Problems by Leveraging Pattern Recognition
Models & Research · 4 min
More Stories
© 2026 Cedar & Bloom. All rights reserved.