OpenAI's New Approach to Evaluating Frontier AI Models

Models & Research

The Engineer

15 Jun 2026 · 4 min read

As AI models become more sophisticated, traditional evaluation methods are falling short. OpenAI shares insights on designing robust tests for modern systems.

OpenAI has long been at the forefront of developing and deploying advanced AI models, but with these models becoming increasingly complex, the need for effective and independent evaluations is more critical than ever. In a recent post, OpenAI shared key lessons learned from evaluating frontier models and provided recommendations for creating valid assessments that can inform emerging standards in the field.

The Evolution of Model Evaluations

Earlier methods often treated AI models like simple chatbots: evaluators would prompt the model with questions, the model would respond, and an evaluator would judge the output. However, today's frontier models are far more capable. They can use tools, maintain context across multiple steps, and integrate into larger workflows. This means that a model’s performance is not just a function of its internal capabilities but also depends heavily on the environment in which it operates.

OpenAI refers to this surrounding setup as the "harness." The harness includes the tools available to the model, the way information is managed, and the mechanisms for error recovery. These elements can significantly influence how a model performs tasks, uses tools, keeps track of information, or recovers from mistakes.

Key Components of Effective Evaluations

To conduct meaningful evaluations in this new landscape, OpenAI recommends explicitly describing two crucial aspects beyond the results themselves:

Claim Specification: What specific claim is the evaluation designed to test? This could be about a model’s capability, the robustness of its safeguards, or how it compares to other models.
Validity Evidence: What evidence supports the validity of the evaluation result? This includes addressing potential issues that could skew results.

Claims Tested in Evaluations

Claims typically fall into one of three categories:

Capability Elicitation: Can the model plausibly produce the capability being evaluated?
- Example: Does a language model generate coherent and contextually relevant responses to complex queries?
Safeguard Performance: How robust are the tested safeguards against specific behaviors or attacks?
- Example: How well does a model resist attempts to elicit harmful content?
Comparison: How do different models perform under equivalent conditions?
- Example: Which version of a language model performs better on a particular task?

Ensuring Validity

Evaluators must also address several factors that could impact the validity of results:

Reward Hacking: Models may exploit shortcuts in tasks or scoring mechanisms to receive credit without demonstrating the intended behavior.
Refusals: Models might refuse to perform tasks in ways that obscure the behavior being tested.
Contamination: Overperformance can occur if evaluation tasks, answers, or variants are present in training data or discoverable during the evaluation.
Broken Problems: Underperformance may result from invalid tasks, such as those with unfair scoring criteria or unsolvable environments.

## Under the Hood

To illustrate these concepts, let’s dive into a practical example. Suppose you want to evaluate a new language model's ability to generate secure code snippets for a web application. Here’s how you might structure your evaluation:

Claim Specification: The model can generate secure and functional code snippets that meet industry standards.
- Harness Setup:
  - Provide the model with access to a code editor and relevant documentation.
  - Ensure the environment supports running generated code to test for functionality and security.
- Validity Evidence:
  - Use multiple evaluators to judge the output, ensuring consistency.
  - Include tasks that require the model to handle edge cases and potential vulnerabilities.
  - Compare results with human-generated code snippets to establish a baseline.

By clearly defining the claim and providing evidence of validity, you can create an evaluation that accurately assesses the model’s capabilities in a real-world context.

## Key Takeaways

Traditional evaluations are insufficient for modern AI models due to their increased complexity and integration into larger workflows.
Effective evaluations require specifying claims and providing evidence of validity, addressing potential issues like reward hacking and contamination.
The harness plays a crucial role in influencing model performance, making it essential to design robust evaluation environments.

As AI continues to advance, the methods for evaluating these models must evolve alongside them. OpenAI’s insights provide a valuable framework for practitioners looking to conduct thorough and reliable assessments of frontier AI systems.