
Share
AC-Small's performance surge on the APEX-Agents leaderboard was no fluke; it excelled across three additional industry benchmarks, showcasing its broadened capabilities and genuine improvement post-training.
In a previous post, we detailed the creation of AC-Small by post-training GLM-4.7 on an agentic dev set. This move catapulted AC-Small from 17th to 4th place on the APEX-Agents leaderboard. The natural follow-up question was whether these improvements were specific to the held-out benchmark or if they represented a genuine enhancement in the model's capabilities.
To answer this, we tested AC-Small on three held-out industry benchmarks: Toolathalon, APEX, and GDPVal. Here’s what we found:
GDPVal is a comprehensive benchmark that measures professional work across 44 occupations and 9 sectors, evaluated by subject-matter experts. It serves as the closest held-out analog to APEX-Agents, providing an end-to-end test of whether the gains from training on APEX-Agents data generalize to out-of-distribution agentic tasks.
If placed on OpenAI’s official GDPVal leaderboard, AC-Small would rank 5th, behind:
And ahead of:

Toolathalon evaluates language agents on multi-step workflows across 32 software applications and 604 tools, requiring coordination over an average of 20 tool-calling turns. This benchmark isolates tool-use fluency as a key capability.
APEX evaluates non-agentic professional reasoning across four domains, providing a broader measure of the model’s general capabilities.
The improvement in GDPVal can largely be attributed to better tool use across the board. Here’s a breakdown:
The results from testing AC-Small on these held-out benchmarks provide strong evidence that the improvements seen on the APEX-Agents leaderboard are not just specific to that dataset. Instead, they represent a genuine enhancement in the model's capabilities, particularly in tool use and professional reasoning. This bodes well for the future of agentic training methods and their potential applications in real-world scenarios.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
2 April 2026
133 articles
Related Articles
Related Articles
More Stories