AC-Small Shows Strong Generalization After Training on APEX-Agents Dev Set

Models & Research

The Engineer

2 Apr 2026 · 3 min read

AC-Small's performance surge on the APEX-Agents leaderboard was no fluke; it excelled across three additional industry benchmarks, showcasing its broadened capabilities and genuine improvement post-training.

In a previous post, we detailed the creation of AC-Small by post-training GLM-4.7 on an agentic dev set. This move catapulted AC-Small from 17th to 4th place on the APEX-Agents leaderboard. The natural follow-up question was whether these improvements were specific to the held-out benchmark or if they represented a genuine enhancement in the model's capabilities.

To answer this, we tested AC-Small on three held-out industry benchmarks: Toolathalon, APEX, and GDPVal. Here’s what we found:

Toolathalon: +8.0 points
APEX: +5.7 points
GDPVal: +7.7 percentage points

Results Across Benchmarks

GDPVal: Improved Performance on Economically Valuable Work

GDPVal is a comprehensive benchmark that measures professional work across 44 occupations and 9 sectors, evaluated by subject-matter experts. It serves as the closest held-out analog to APEX-Agents, providing an end-to-end test of whether the gains from training on APEX-Agents data generalize to out-of-distribution agentic tasks.

Before: AC-Small had a win+tie rate of 55.0%.
After: This rate improved to 62.7%, a significant +7.7 percentage points gain.

If placed on OpenAI’s official GDPVal leaderboard, AC-Small would rank 5th, behind:

GPT-5.4 (83.0%)
GPT-5.4 Pro (82.0%)
GPT-5.2 Pro (74.1%)
GPT-5.2 (70.9%)

And ahead of:

Claude Opus 4.5 (59.6%)

Toolathalon: Enhanced Tool Use Fluency

Toolathalon evaluates language agents on multi-step workflows across 32 software applications and 604 tools, requiring coordination over an average of 20 tool-calling turns. This benchmark isolates tool-use fluency as a key capability.

Before: AC-Small's performance was baseline.
After: It improved by +8.0 points, indicating a significant enhancement in its ability to handle complex multi-step tasks involving various tools.

APEX: Non-Agentic Professional Reasoning

APEX evaluates non-agentic professional reasoning across four domains, providing a broader measure of the model’s general capabilities.

Before: AC-Small's performance was baseline.
After: It improved by +5.7 points, suggesting that the training on agentic tasks also bolstered its ability to handle non-agentic professional reasoning.

What Drives the GDPVal Gain?

The improvement in GDPVal can largely be attributed to better tool use across the board. Here’s a breakdown:

Enhanced Tool Coordination: AC-Small demonstrated improved coordination across platforms, which is crucial for handling multi-step workflows.
Fluency in Tool Usage: The model showed greater fluency in using a wide range of tools, reducing errors and increasing efficiency.
Adaptability to New Tasks: The training on agentic tasks likely enhanced the model’s ability to adapt to new, unseen professional scenarios.

Conclusion

The results from testing AC-Small on these held-out benchmarks provide strong evidence that the improvements seen on the APEX-Agents leaderboard are not just specific to that dataset. Instead, they represent a genuine enhancement in the model's capabilities, particularly in tool use and professional reasoning. This bodes well for the future of agentic training methods and their potential applications in real-world scenarios.