Realtime Evaluation: The Key to Robust Voice Systems

Tools & Engineering

The Engineer

27 Jan 2026 · 3 min read

Realtime evaluation helps voice system developers spot issues early, speeding up the path from demo to deployment by focusing on incremental complexity and thorough testing.

When it comes to voice systems, the gap between a demo that "seems fine" and a production-ready solution is often bridged by thorough evaluation. This guide from OpenAI's Cookbook provides a structured approach to evaluating voice systems, emphasizing the importance of building complexity incrementally-starting simple (Crawl), adding realism (Walk), and finally testing multi-turn interactions (Run). By investing in robust evaluations, teams can ship to production 5–10× faster, thanks to the ability to identify failures, understand their causes, and fix them with confidence.

Realtime Eval Harness Code

For those who want to dive into the code, OpenAI provides a GitHub repository with complete reference harnesses for each stage of evaluation:

GitHub repo path: openai-cookbook/examples/evals/realtime_evals
- Crawl Harness (Single-turn Replay): crawl_harness
- Walk Harness (Saved Audio Replay): walk_harness
- Run Harness (Model-simulated Multi-turn): run_harness

These harnesses are designed to be adaptable, allowing you to point Codex at the relevant harness and ask it to tailor the code to your specific dataset and graders.

Part I: Foundations

1) Why Realtime Eval?

Realtime evaluation is crucial for ensuring that voice systems perform consistently in real-world scenarios. Here’s why:

Identifying Failures Early: Realtime evals help you catch issues early, before they become critical problems in production.
Improving Robustness: By testing under realistic conditions, you can build a more robust system that handles edge cases and unexpected inputs.
Accelerating Development: Teams that invest in evals can move faster because they have clear insights into what’s working and what isn’t. This transparency allows for targeted improvements, reducing the time spent on debugging and rework.

Part II: The Crawl-Walk-Run Approach

1) Crawl: Single-turn Replay

The first step is to start with single-turn interactions. This involves replaying recorded user inputs to see how the system responds in a controlled environment. Key considerations include:

Dataset: Use a diverse set of inputs to cover various scenarios and edge cases.
Graders: Define clear criteria for what constitutes a successful response. Graders can be human evaluators or automated scripts.
Harness: The Crawl harness provides a framework for running these evaluations efficiently.

2) Walk: Saved Audio Replay

Once the system is performing well with single-turn interactions, it’s time to add more realism by using saved audio files. This stage involves:

Realistic Input: Use actual user audio to simulate real-world conditions.
Complexity: Introduce more complex scenarios, such as background noise or varying accents.
Evaluation: Continue using your graders to assess the system's performance.

3) Run: Model-simulated Multi-turn

The final stage is to test multi-turn interactions, which are essential for voice systems that need to maintain context over multiple exchanges. This involves:

Multi-turn Scenarios: Simulate extended conversations with the system.
Context Management: Ensure the system can handle and remember context across turns.
Advanced Graders: Use more sophisticated graders to evaluate the coherence and consistency of multi-turn interactions.

Part III: Building a Production Flywheel

To maintain and improve your voice system over time, it’s essential to create a production flywheel. This involves:

Continuous Monitoring: Regularly monitor the system in production to catch new issues.
Feedback Loop: Use real failures to generate new tests and improve the evaluation process.
Iterative Improvement: Continuously refine the system based on feedback from both users and evaluations.