
Share
Discover how to avoid the pitfalls of using Large Language Models as judges by streamlining evaluation processes and focusing on essential metrics, ensuring your AI systems deliver real value.
Earlier this year, I wrote about the importance of evaluation in AI product development. Many readers asked how to get started with using Large Language Models (LLMs) as judges. This guide distills what I’ve learned from helping over 30 companies set up their evaluation systems.
Ever spent weeks building an AI system, only to realize you have no idea if it’s actually working? You’re not alone. I've seen teams repeatedly fall into these common traps when using LLMs to evaluate AI outputs:
The result? Teams end up buried under mountains of metrics they can't trust or use. Progress grinds to a halt, and everyone gets frustrated. For example, it’s not uncommon for me to see dashboards that look like this:
An illustrative example of a bad eval dashboard
Tracking a bunch of scores on a 1-5 scale is often a sign of a flawed evaluation process (I’ll discuss why later). In this post, I’ll show you how to avoid these pitfalls. The solution is a technique I call “Critique Shadowing”. Here’s how to do it, step by step.
In most organizations, there is usually one (or maybe two) key individuals whose judgment is crucial for the success of your AI product. These are the people with deep domain expertise or represent your target users. Identifying and involving this Principal Domain Expert early in the process is critical.

Once you have your Principal Domain Expert, work with them to define clear, actionable criteria. This involves:
Let’s say you’re building an AI chatbot for customer service. Your Principal Domain Expert might help define criteria like:
Each criterion would have a well-defined scale and rubric.
Critique Shadowing involves using an LLM to shadow the judgments of your Principal Domain Expert. This process helps you:
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
1 November 2024
88 articles
Related Articles
Related Articles
More Stories