
Share
At the AI Engineer World’s Fair, experts revealed how systematic evaluation frameworks are revolutionizing feature development, moving away from guesswork to data-driven iteration and deployment.
Last week, I attended the 2025 AI Engineer World’s Fair in San Francisco with a group of founders from Seattle Foundations. With over 20 tracks on various topics, I focused deeply on Evals, learning how companies like Google, Notion, Zapier, and Vercel build and deploy evaluations for their AI features.
Without evals, AI feature development often relies on ad-hoc methods. You might test a few inputs and judge the outputs based on gut feelings. For instance, if you're building a chatbot, you might riff on a few messages, get a feel for the responses, and decide it's ready to ship. However, this approach becomes problematic when new models launch or edge cases arise. The non-deterministic nature of AI systems makes it difficult to keep track of all permutations, necessitating a more systematic improvement process.
The term "evals" is used in two contexts:
The flywheel model captures this broader process:
One standout talk was by Pi Labs, founded by former Googlers David Karam and Achint Srivastava. They discussed Google's approach to assessing search result quality, which involves breaking down a good search result into 300 distinct signals. These signals include:

Each signal is either automatically judged by code or, more recently, by large language models (LLMs). These signals are then combined into a weighted sum to produce a final score for each search result. Google uses these scores to sort and rank its search results.
Pi Labs' premise is that while AI features can benefit from similar systematic evaluation processes, they often lack the structured approach seen in established systems like Google's search ranking.
The flywheel model emphasizes continuous improvement through rapid iteration:
Implementing this flywheel model can lead to several benefits:
The AI eval flywheel offers a structured approach to developing and improving AI features. By systematically grading outputs, managing inputs, curating production data, and iterating rapidly, you can ensure that your AI systems are robust, reliable, and continuously improving.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
16 June 2025
88 articles
Related Articles
Related Articles
More Stories