OpenAI Unveils gpt-realtime and Realtime API Enhancements for Robust Voice Agents

Products & Applications

The Engineer

29 Aug 2025 · 4 min read

OpenAI's new Realtime API and `gpt-realtime` model offer unprecedented audio clarity and natural conversation flow, revolutionizing how developers build robust voice agents for enterprises.

OpenAI has announced the general availability of its Realtime API, accompanied by a suite of new features designed to empower developers and enterprises in building reliable, production-ready voice agents. Among these updates is the introduction of gpt-realtime, OpenAI's most advanced speech-to-speech model yet. This release marks significant improvements in audio quality, instruction following, and natural language processing, making it easier to deploy sophisticated voice applications.

What’s New with the Realtime API?

The Realtime API now supports several key features that enhance its functionality and flexibility:

Remote MCP Server Support: The API can now interface with remote Model Control Protocol (MCP) servers. This allows developers to manage and control models more efficiently, especially in distributed environments.
Image Input: In addition to text and audio inputs, the Realtime API now supports image data. This opens up new possibilities for applications that require visual context, such as virtual assistants that can describe images or provide detailed information about visual content.
SIP Phone Calling Support: The API now includes Session Initiation Protocol (SIP) support, enabling voice agents to make and receive phone calls directly. This is particularly useful for customer service and telephony applications.

Introducing gpt-realtime

gpt-realtime represents a significant leap forward in speech-to-speech technology. Here are the key improvements:

Natural-Sounding Speech: The model produces more natural and expressive speech, making interactions feel more human-like.
Improved Instruction Following: gpt-realtime excels at following complex instructions and executing multi-step tasks with precision. It can handle detailed commands, such as narrowing down listings based on specific criteria or guiding users through financial calculations.
Enhanced Tool Integration: The model can seamlessly call external tools and APIs, allowing it to perform a wide range of functions. For example, it can read disclaimer scripts word-for-word, repeat back alphanumeric sequences, and switch between languages mid-sentence without losing context.
New Voices: OpenAI is introducing two new voices, Cedar and Marin, which are available exclusively through the Realtime API.

Real-World Applications

The enhancements in gpt-realtime and the Realtime API have already been put to use by early adopters. For instance, Zillow, a leading real estate platform, has integrated these technologies into their services:

“The new speech-to-speech model in OpenAI's Realtime API shows stronger reasoning and more natural speech-allowing it to handle complex, multi-step requests like narrowing listings by lifestyle needs or guiding affordability discussions with tools like our BuyAbility score. This could make searching for a home on Zillow or exploring financing options feel as natural as a conversation with a friend, helping simplify decisions like buying, selling, and renting a home.”

– Josh Weisberg, Head of AI at Zillow

Technical Details

Model Architecture: gpt-realtime is built on an advanced transformer architecture optimized for real-time processing. This allows it to handle audio streams efficiently, reducing latency and improving response times.
Training Data: The model was trained on a diverse dataset that includes conversational data, domain-specific information, and multilingual content. This ensures that it can perform well across various use cases and languages.
Latency and Reliability: Unlike traditional speech-to-speech pipelines that involve multiple models (speech-to-text and text-to-speech), gpt-realtime processes and generates audio directly through a single model. This streamlined approach not only reduces latency but also preserves the nuance in speech, leading to more natural and expressive responses.

Conclusion

The introduction of gpt-realtime and the updated Realtime API by OpenAI marks a significant step forward in the development of voice agents. These enhancements offer developers and enterprises the tools they need to build sophisticated, reliable, and high-quality voice applications that can handle complex tasks with ease. Whether it’s customer support, personal assistance, or education, gpt-realtime is set to revolutionize how we interact with voice technology.