The AI Eval Flywheel: A Systematic Approach to Feature Development and Rapid Iteration

Models & Research

The Engineer

16 Jun 2025 · 3 min read

At the AI Engineer World’s Fair, experts revealed how systematic evaluation frameworks are revolutionizing feature development, moving away from guesswork to data-driven iteration and deployment.

Last week, I attended the 2025 AI Engineer World’s Fair in San Francisco with a group of founders from Seattle Foundations. With over 20 tracks on various topics, I focused deeply on Evals, learning how companies like Google, Notion, Zapier, and Vercel build and deploy evaluations for their AI features.

Beyond Vibes: The Need for Systematic Evaluation

Without evals, AI feature development often relies on ad-hoc methods. You might test a few inputs and judge the outputs based on gut feelings. For instance, if you're building a chatbot, you might riff on a few messages, get a feel for the responses, and decide it's ready to ship. However, this approach becomes problematic when new models launch or edge cases arise. The non-deterministic nature of AI systems makes it difficult to keep track of all permutations, necessitating a more systematic improvement process.

Little eval and Big Eval: Understanding the Framework

The term "evals" is used in two contexts:

Little e (Specific Step): Systematically grading the outputs of your feature. For example, you might score an output on a scale of 0 to 100 based on specific characteristics.
Big E (Broader Process): A structured framework that includes input management, production usage filtering and curation, and rapid iteration.

The flywheel model captures this broader process:

Scoring & Signals: The Heart of the Flywheel

One standout talk was by Pi Labs, founded by former Googlers David Karam and Achint Srivastava. They discussed Google's approach to assessing search result quality, which involves breaking down a good search result into 300 distinct signals. These signals include:

Page Speed: How quickly the page loads.
Backlinks: The number and quality of links pointing to the page.
Writing Quality: The clarity and coherence of the content.
Page Design: The visual appeal and user experience.
Query Relevance: How directly the result answers the specific query.

Each signal is either automatically judged by code or, more recently, by large language models (LLMs). These signals are then combined into a weighted sum to produce a final score for each search result. Google uses these scores to sort and rank its search results.

Pi Labs' premise is that while AI features can benefit from similar systematic evaluation processes, they often lack the structured approach seen in established systems like Google's search ranking.

The Flywheel in Action

The flywheel model emphasizes continuous improvement through rapid iteration:

Grading Outputs: Regularly score your feature’s outputs based on predefined criteria.
Structured Inputs: Use a consistent set of inputs for testing to ensure reproducibility and reliability.
Production Usage Filtering & Curation: Analyze real-world usage data to identify edge cases and areas for improvement.
Rapid Iteration: Use the insights gained from grading and production data to quickly refine and enhance your AI feature.

Practical Benefits

Implementing this flywheel model can lead to several benefits:

Consistency: Ensures that all outputs are evaluated against a standardized set of criteria.
Transparency: Provides clear metrics for assessing performance, making it easier to communicate progress to stakeholders.
Efficiency: Facilitates faster development cycles by identifying and addressing issues promptly.

Conclusion

The AI eval flywheel offers a structured approach to developing and improving AI features. By systematically grading outputs, managing inputs, curating production data, and iterating rapidly, you can ensure that your AI systems are robust, reliable, and continuously improving.