Long-Horizon Agents Demand Agentic Judges for Accurate Evaluations

Models & Research

The Engineer

8 Jun 2026 · 3 min read

As AI agents take on more complex, end-to-end tasks, traditional LLM judges fall short. A new approach with agentic judges is needed to ensure accurate and reliable evaluations.

Judgment Labs, a leading AI research and development company, has announced $32M in funding led by Lightspeed. This significant investment underscores the growing importance of robust evaluation methods for long-horizon agents-AI systems that can autonomously perform complex tasks from start to finish. Traditional LLM judges, which are often used to evaluate these agents, are increasingly inadequate as they struggle with long trajectories and stateful actions.

The Limitations of LLM Judges

Most teams rely on a simple LLM judge approach for evaluating agent performance. This involves providing the judge with the user query, final agent output, some metadata, and a rubric, then asking whether the agent behaved as intended. However, this method breaks down when dealing with long-horizon agents that can span hundreds of tool calls across various systems.

Long Trajectories: Long-horizon agents often generate extensive trajectories involving multiple steps and interactions with different databases, services, documents, and other systems. For example, a sales agent might research leads, update a CRM, send an email, and book a meeting. A coding agent could edit numerous files, update AWS configurations, and open a GitHub pull request. The problem is that LLM judges have limited context windows, typically around 2048 tokens. This means they can only see a small portion of the agent's trajectory at any given time, leading to incomplete evaluations.
Stateful Actions: Production agents do more than generate text; they perform actions that change state in external systems. For instance, a sales agent might update lead statuses in a CRM, and an evaluator needs to verify these changes against the actual system. LLM judges are limited to seeing only the trajectory and cannot access or verify stateful changes in external systems such as Google Calendar, CRMs, AWS, or GitHub.

The Need for Agentic Judges

To address these limitations, Judgment Labs proposes using agentic judges-judges that can search, verify, and adapt. These judges are designed to handle long-horizon agents more effectively by:

Expanding Context: Agentic judges can break down the agent's trajectory into manageable chunks, allowing them to evaluate each part in detail. This ensures that no critical information is missed due to context window limitations.
Verifying Stateful Actions: Unlike LLM judges, agentic judges can interact with external systems to verify state changes. For example, they can check if a lead status was updated correctly in a CRM or if an AWS configuration change was applied as intended.
Adapting to New Scenarios: Agentic judges are more flexible and can adapt to new evaluation scenarios without requiring extensive retraining. This makes them more robust and reliable for evaluating complex agent behaviors.

Key Takeaways

Traditional LLM Judges Fall Short: They struggle with long trajectories and stateful actions, leading to incomplete and inaccurate evaluations.
Agentic Judges Offer a Solution: By expanding context, verifying stateful actions, and adapting to new scenarios, agentic judges provide more reliable and comprehensive evaluations of long-horizon agents.
Significant Investment Signals Importance: The $32M funding round led by Lightspeed highlights the growing recognition of the need for advanced evaluation methods in AI development.

In practice, this shift toward agentic judges is crucial as the industry moves toward more sophisticated and autonomous AI systems. By ensuring accurate evaluations, teams can identify and address agent failures more effectively, leading to better customer satisfaction and more reliable AI solutions.