Multi-Agent Harness Design for Long-Running Autonomous Applications

Tools & Engineering

The Engineer

25 Mar 2026 · 3 min read

This article explores how Anthropic enhanced Claude’s capabilities for long-running autonomous tasks through a novel multi-agent harness design, inspired by GAN principles, to ensure consistent performance over extended periods.

At Anthropic, we're always pushing the boundaries of what AI can do. One of our key focuses has been on developing long-running autonomous applications-software that can operate without human intervention over extended periods. This article delves into how we improved Claude's performance in frontend design and long-running application development using a multi-agent harness inspired by Generative Adversarial Networks (GANs).

The Challenge

Over the past few months, I've been tackling two interconnected problems: getting Claude to produce high-quality frontend designs and building complete applications autonomously. Our earlier work on frontend design skills and long-running coding agent harnesses showed significant improvements through prompt engineering and harness design. However, both approaches eventually hit performance ceilings.

Breaking Through with Multi-Agent Design

To overcome these limitations, I drew inspiration from GANs, which are known for their ability to generate high-quality outputs by pitting a generator against an evaluator. In our context, this meant creating a multi-agent system where:

Generator: Produces the frontend designs or code.
Evaluator: Assesses the quality of the output based on predefined criteria.

Frontend Design

For frontend design, the challenge was to turn subjective judgments into concrete, gradable terms. We developed a set of criteria that could objectively evaluate aspects like layout, color harmony, and user experience. This evaluator provided feedback to the generator, helping it refine its designs over multiple iterations.

Criteria Development: We defined metrics for visual appeal, usability, and responsiveness.
Feedback Loop: The evaluator's scores were fed back to the generator, allowing it to learn from mistakes and improve.

Long-Running Autonomous Coding

Applying this multi-agent approach to long-running autonomous coding involved carrying over lessons from our earlier harness work:

Decomposition: Breaking down the build process into manageable chunks.
Structured Artifacts: Using artifacts to maintain context between sessions.

The final architecture consisted of three agents:

Planner: Breaks down the product spec into a task list.
Generator: Implements the tasks, one feature at a time.
Evaluator: Assesses the quality and provides feedback.

Why Naive Implementations Fall Short

Previous approaches to long-running autonomous coding often fell short due to issues like context loss and drift over time. In an earlier experiment, we used an initializer agent to decompose a product spec into a task list, followed by a coding agent that implemented tasks one at a time. While this method improved performance, it still struggled with complex tasks.

Context Loss: Agents often lost track of the overall project context over multiple sessions.
Drift Over Time: The quality and direction of the code could deviate from the initial spec.

The Multi-Agent Solution

By introducing the evaluator agent, we addressed these issues:

Continuous Feedback: The evaluator provided ongoing assessments, ensuring the generator stayed on track.
Context Maintenance: Structured artifacts helped maintain context across sessions, reducing drift.

The result was a robust system capable of producing rich full-stack applications over multi-hour autonomous coding sessions. This approach not only improved the quality of the output but also increased reliability and consistency.

Conclusion

Our multi-agent harness design has significantly enhanced Claude's capabilities in both frontend design and long-running application development. By leveraging the strengths of GANs and structured feedback, we've created a more reliable and efficient system for autonomous software engineering.