
Share
SWE-bench Verified offers a rigorously human-reviewed set of tasks to better assess AI models' proficiency in handling intricate software engineering problems, enhancing reliability and accuracy in benchmarking.
OpenAI has announced the release of SWE-bench Verified, a human-validated subset of the popular SWE-bench benchmark. This new version is designed to more reliably evaluate AI models' ability to solve real-world software issues, addressing some of the limitations and inaccuracies found in the original benchmark.
Evaluating AI models’ capabilities in software engineering tasks is crucial for ensuring they can operate autonomously in complex, real-world scenarios. The challenges include:
SWE-bench is a benchmark designed to evaluate large language models (LLMs) in their ability to solve real-world software issues. Each sample in the test set is derived from resolved GitHub issues in 12 open-source Python repositories. The samples include:
The unit tests are categorized as:
During testing, OpenAI identified several issues with SWE-bench:
To address these issues, OpenAI collaborated with the authors of SWE-bench to create a more reliable subset:

As of August 5, 2024, top-scoring agents on the original SWE-bench achieved:
These scores highlight the progress made in AI for software engineering but also underscore the need for more rigorous evaluation methods.
SWE-bench Verified includes:
For software engineers and researchers, SWE-bench Verified offers:
The release of SWE-bench Verified is a significant step towards more accurate and reliable evaluation of AI models in software engineering. By addressing the limitations of the original benchmark, OpenAI and the SWE-bench team have provided a valuable tool for researchers and practitioners to advance the capabilities of autonomous software development agents.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
14 August 2024
88 articles
Related Articles
Related Articles
More Stories