
Share
SWE-Bench claims to gauge AI coding prowess but falls short, revealing limitations in current benchmarking practices that fail to capture real-world software development complexity.
While working on StoryMachine, an experiment aimed at breaking down software tasks into agent-executable units, I had the chance to dive deep into popular coding benchmarks. These benchmarks are often marketed as comprehensive measures of a model’s ability to write code, but they actually measure something much narrower. This is why, for instance, Claude scoring 80% on SWE-bench doesn’t mean it can one-shot 80% of the tasks I throw at it.
Let's break down what these benchmarks are really measuring and why that matters.
SWE-bench measures how well a coding agent can submit a patch for a real-world GitHub issue that passes the unit tests for that issue. This is a crucial but specific task, and it’s important to understand its scope and limitations.
While SWE-bench Verified is a valuable benchmark for evaluating how well an AI can handle specific coding tasks, it’s not a comprehensive measure of overall coding ability. It focuses on patching and passing unit tests, which are important but represent only a small part of what software development involves.
Aider Polyglot evaluates a model's ability to work with multiple programming languages simultaneously. This is particularly useful for understanding how well an AI can handle cross-language tasks, which are common in real-world projects.
Aider Polyglot is a more diverse benchmark than SWE-bench Verified, but it still focuses on specific tasks. While it provides insights into cross-language capabilities, it doesn’t fully capture the complexity of real-world software development.

LiveCodeBench evaluates how well an AI can write code in a live coding environment, where it needs to interact with a running system and make changes on the fly. This is particularly useful for understanding how well a model can handle dynamic and interactive tasks.
LiveCodeBench is one of the more realistic benchmarks in terms of simulating actual development environments. However, it still has limitations and doesn’t fully capture the breadth of challenges faced by software engineers.
There are several other coding benchmarks out there, each with its own strengths and weaknesses. For example, some focus on specific domains like web development or data science, while others evaluate more general programming skills.
The complexity of software development makes it incredibly difficult to create a single benchmark that captures all the nuances of coding. However, this complexity also highlights the potential of AI coding agents. By breaking down tasks into smaller, manageable units (as in StoryMachine), we can leverage these agents more effectively.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
26 September 2025
88 articles
Related Articles
Related Articles
More Stories