Scale Labs Unveils Comprehensive AI Leaderboard for Frontier, Agentic, and Safety Capabilities

Models & Research

The Engineer

30 May 2024 · 2 min read

Scale Labs introduces a groundbreaking leaderboard to assess cutting-edge AI models from top institutions, focusing on complex metrics like agentic coding and safety alignment to drive innovation.

Testing the Limits of AI with Robust Benchmarks

Scale Labs has launched a comprehensive leaderboard that evaluates the latest AI models from leading research institutions like OpenAI, Anthropic, Google, Meta, and open-source contributors. This leaderboard is designed to push the boundaries of AI by focusing on three critical areas: frontier capabilities, agentic capabilities, and safety alignment.

What Changed Technically

The new leaderboard introduces over 20 benchmarks that specifically target advanced AI functionalities:

Agentic Coding: Evaluates deep code comprehension and reasoning.
Frontier Reasoning: Assesses models' ability to handle complex, long-horizon tasks.
Safety Alignment: Ensures models can operate safely in real-world scenarios.

These benchmarks are crucial for practitioners because they provide a standardized way to compare different AI models across various dimensions. This is particularly important as the field of AI continues to evolve rapidly, and understanding the strengths and weaknesses of each model is essential for practical applications.

Key Benchmarks and Results

SWE Atlas - Codebase QnA

Description: Evaluates deep code comprehension and reasoning.
Top Performers:
- gpt-5.4 (xHigh) (Codex CLI): 35.48±8.70
- claude-opus-4.6 Thinking (Claude Code Harness): 31.50±8.62
- gpt-5.2-2025-12-11 (High) (SWE-Agent): 29.03±8.53

MCP Atlas

Description: Evaluates real-world tool use through the Model Context Protocol (MCP).
Top Performers:
- claude-opus-4-5-20251101: 62.30±1.76
- gpt-5.2-2025-12-11 (NEW): 60.57±1.62
- gemini-3-flash-preview (NEW): 57.40±1.48

SWE-Bench Pro (Public Dataset)

Description: Evaluates long-horizon software engineering tasks in public open source repositories.
Top Performers:
- claude-opus-4-5-20251101: 45.89±3.60
- claude-4-5-Sonnet: 43.60±3.60
- gemini-3-pro-preview: 43.30±3.60

SWE-Bench Pro (Private Dataset)

Description: Evaluates long-horizon software engineering tasks in commercial-grade private repositories.
Top Performers:
- gpt-5.2-2025-12-11 (NEW): 23.81±5.09
- claude-opus-4-5-20251101 (NEW): 23.44±5.07
- gemini-3-pro-preview (NEW): 17.95±4.78

SciPredict

Description: Forecasting scientific experiment outcomes.
Top Performers:
- gemini-3-pro-preview: 25.27±1.92
- claude-opus-4-5-20251101: 23.05±0.51
- claude-opus-4-1-20250805: 22.22±1.48

Humanity's Last Exam

Description: Challenging LLMs at the frontier of human knowledge.
Top Performers:
- gpt-5.4-pro-2026-03-05 (NEW): 44