
Share
As AI benchmarks shift towards measuring task duration in human hours, the METR plot's reliance on just 14 sample points raises concerns about its reliability and fairness in evaluating long-term capabilities.
In 2025, the Machine Evaluation of Task Relevance (METR) plot emerged as a key metric for assessing AI's ability to handle long-horizon tasks. Instead of focusing solely on accuracy, METR proposed measuring the length of tasks models can complete, in terms of estimated human hours needed. This shift was welcome, as it aligned more closely with real-world automation impacts and economic outcomes, such as labor laws based on work hours.
However, there's a significant issue with how we're interpreting and using the METR plot, especially within the AI Safety community. Let’s dive into the technical details to understand why this matters and what we can do about it.
As of 2025, the METR plot showed that frontier AI progress occurred in the regime of horizon lengths between 1 to 4 hours. Here’s the catch: there are only 14 samples with estimated task lengths in this range.
METR Horizon Length Measurement:
Implementation Details:

Overindexing on METR:
Investment Decisions:
The METR plot is a valuable tool for evaluating AI's ability to handle long-horizon tasks, but we need to be cautious about overinterpreting its results. By addressing the small sample size and potential overfitting issues, we can ensure that the METR plot remains a reliable and meaningful benchmark for AI research.
Tags
Original Sources
↗ https://shash42.substack.com/p/how-to-game-the-metr-plot?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
22 December 2025
88 articles
Related Articles
Related Articles
More Stories