The METR Plot's 14-Sample Dilemma: Why We Should Rethink AI Benchmarking

Models & Research

The Engineer

22 Dec 2025 · 3 min read

As AI benchmarks shift towards measuring task duration in human hours, the METR plot's reliance on just 14 sample points raises concerns about its reliability and fairness in evaluating long-term capabilities.

In 2025, the Machine Evaluation of Task Relevance (METR) plot emerged as a key metric for assessing AI's ability to handle long-horizon tasks. Instead of focusing solely on accuracy, METR proposed measuring the length of tasks models can complete, in terms of estimated human hours needed. This shift was welcome, as it aligned more closely with real-world automation impacts and economic outcomes, such as labor laws based on work hours.

However, there's a significant issue with how we're interpreting and using the METR plot, especially within the AI Safety community. Let’s dive into the technical details to understand why this matters and what we can do about it.

The 1-4 Hour Range: A Small Sample Size Problem

As of 2025, the METR plot showed that frontier AI progress occurred in the regime of horizon lengths between 1 to 4 hours. Here’s the catch: there are only 14 samples with estimated task lengths in this range.

Why does this matter?
- Small sample size: With just 14 data points, any observed trends could be highly sensitive to individual tasks and may not generalize well.
- Publicly known tasks: The topics of these 14 tasks are public, making it easy for labs to optimize their models specifically for these benchmarks. This can lead to overfitting and misleading performance metrics.
- Limited information beyond accuracy: The "horizon length" under METR's assumptions might not add much new information beyond traditional benchmark accuracy.

Technical Breakdown of the METR Plot

METR Horizon Length Measurement:
- Definition: The horizon length is the estimated number of human hours required to complete a task.
- Calculation: It involves estimating the time it would take for a human to perform each step of the task and summing these estimates.
- Public Data: The METR authors transparently provide task metadata, which includes details about each sample.
Implementation Details:
- Task Selection: Tasks were chosen to cover a wide range of domains, from simple text generation to complex reasoning tasks.
- Model Evaluation: Models are evaluated based on their ability to complete these tasks within the estimated human hours.
- Benchmarking Challenges: The small sample size and public nature of tasks can lead to overfitting, where models perform well on specific benchmarks but fail to generalize.

Implications for AI Research and Safety

Overindexing on METR:
- The AI Safety community has been heavily influenced by the METR plot. Researchers are making significant updates to timelines and research priorities based on these results.
- For example, the Claude 4.5 Opus result received over 200 likes within 6 hours, indicating its perceived importance.
Investment Decisions:
- Anecdotal evidence suggests that the METR plot has influenced significant investment decisions. However, it's unclear how much weight these decisions are giving to a metric based on just 14 samples.

Moving Forward

Increase Sample Size: To make the METR plot more robust, we need a larger and more diverse set of tasks.
Blind Evaluation: Consider using blind evaluation methods where task details are not publicly known until after model submission.
Complementary Metrics: Continue to use traditional accuracy metrics alongside horizon length to get a more comprehensive picture of model performance.

Conclusion

The METR plot is a valuable tool for evaluating AI's ability to handle long-horizon tasks, but we need to be cautious about overinterpreting its results. By addressing the small sample size and potential overfitting issues, we can ensure that the METR plot remains a reliable and meaningful benchmark for AI research.