
Share
As AI agents take on more complex, end-to-end tasks, traditional LLM judges fall short. A new approach with agentic judges is needed to ensure accurate and reliable evaluations.
Judgment Labs, a leading AI research and development company, has announced $32M in funding led by Lightspeed. This significant investment underscores the growing importance of robust evaluation methods for long-horizon agents-AI systems that can autonomously perform complex tasks from start to finish. Traditional LLM judges, which are often used to evaluate these agents, are increasingly inadequate as they struggle with long trajectories and stateful actions.
Most teams rely on a simple LLM judge approach for evaluating agent performance. This involves providing the judge with the user query, final agent output, some metadata, and a rubric, then asking whether the agent behaved as intended. However, this method breaks down when dealing with long-horizon agents that can span hundreds of tool calls across various systems.
Long Trajectories: Long-horizon agents often generate extensive trajectories involving multiple steps and interactions with different databases, services, documents, and other systems. For example, a sales agent might research leads, update a CRM, send an email, and book a meeting. A coding agent could edit numerous files, update AWS configurations, and open a GitHub pull request. The problem is that LLM judges have limited context windows, typically around 2048 tokens. This means they can only see a small portion of the agent's trajectory at any given time, leading to incomplete evaluations.
Stateful Actions: Production agents do more than generate text; they perform actions that change state in external systems. For instance, a sales agent might update lead statuses in a CRM, and an evaluator needs to verify these changes against the actual system. LLM judges are limited to seeing only the trajectory and cannot access or verify stateful changes in external systems such as Google Calendar, CRMs, AWS, or GitHub.
To address these limitations, Judgment Labs proposes using agentic judges-judges that can search, verify, and adapt. These judges are designed to handle long-horizon agents more effectively by:

Expanding Context: Agentic judges can break down the agent's trajectory into manageable chunks, allowing them to evaluate each part in detail. This ensures that no critical information is missed due to context window limitations.
Verifying Stateful Actions: Unlike LLM judges, agentic judges can interact with external systems to verify state changes. For example, they can check if a lead status was updated correctly in a CRM or if an AWS configuration change was applied as intended.
Adapting to New Scenarios: Agentic judges are more flexible and can adapt to new evaluation scenarios without requiring extensive retraining. This makes them more robust and reliable for evaluating complex agent behaviors.
In practice, this shift toward agentic judges is crucial as the industry moves toward more sophisticated and autonomous AI systems. By ensuring accurate evaluations, teams can identify and address agent failures more effectively, leading to better customer satisfaction and more reliable AI solutions.
Tags
Original Sources
Agent Judge: Solving Long-Context Evals for Production Agents
↗ https://www.judgmentlabs.ai/blogs/agent-judge-solving-long-context-evaluations?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
8 June 2026
67 articles
Related Articles

Global Researchers Compete to Shape AI's Future in Organizations
Models & Research · 3 min

MiniMax M3 Challenges GPT-5.5 and Gemini 3.1 Pro with Superior Performance at a Fraction of the Cost
Models & Research · 4 min

OpenAI’s AI Solves Complex Math Problems by Leveraging Pattern Recognition
Models & Research · 4 min
Related Articles

Global Researchers Compete to Shape AI's Future in Organizations
Models & Research · 3 min

MiniMax M3 Challenges GPT-5.5 and Gemini 3.1 Pro with Superior Performance at a Fraction of the Cost
Models & Research · 4 min

OpenAI’s AI Solves Complex Math Problems by Leveraging Pattern Recognition
Models & Research · 4 min
More Stories
© 2026 Cedar & Bloom. All rights reserved.