ChatGPT Images: Scaling to 100 Million New Users in a Week

Products & Applications

The Engineer

16 May 2025 · 4 min read

As ChatGPT Images surged past 100 million users in a week, OpenAI's engineering team navigated unprecedented scalability challenges, revealing the behind-the-scenes strategies that kept the service running smoothly.

ChatGPT Images, the latest feature from OpenAI, has been a massive hit, attracting 100 million new users and generating 700 million images in its first week. This unprecedented growth posed significant scalability challenges for the engineering team. I sat down with Sulman Choudhry (Head of Engineering, ChatGPT) and Srinivas Narayanan (VP of Engineering, OpenAI) to understand how they managed this launch.

Launch: Handling Unexpected Load

From day one, the load on ChatGPT Images was far higher than anticipated. The feature went viral in India, with up to 1 million new users signing up per hour at peak times. Despite these challenges, the team avoided major outages by implementing robust load testing and isolation strategies.

Initial Load: The system saw a surge of 100 million new users within the first week.
Viral Growth in India: A significant portion of this growth came from India, where the feature gained rapid traction.
Peak Sign-ups: At its peak, the system was handling up to 1 million new user sign-ups per hour.

How ChatGPT Images Works

The technical architecture behind ChatGPT Images is sophisticated and involves several key components:

Image Tokens: Similar to text tokens in language models, image tokens are used to represent parts of an image.
Decoder: A decoder model generates the final image from these tokens.
Multiple Passes: The generation process involves multiple passes to refine the image quality.
Tech Stack:
- Python: Used for most of the backend logic and data processing.
- FastAPI: Provides a fast, modern web framework for building APIs.
- C: Used for performance-critical components.
- Temporal: A distributed workflow orchestration system that helps manage complex tasks.

Changing the Engine While Speeding on the Highway

When the system started struggling under the rising load, the team had to make significant changes on-the-fly. They rewrote the image generation process from synchronous to asynchronous, ensuring users didn't notice any disruptions:

Synchronous to Asynchronous: The initial implementation was synchronous, which became a bottleneck as the load increased. Switching to an asynchronous approach allowed the system to handle more requests efficiently.
User Experience: Despite these changes, users did not experience any noticeable downtime or performance issues.

Reliability Challenges

The massive load on ChatGPT Images overwhelmed other OpenAI systems, but major outages were avoided through extensive preparation:

System Isolation: Months of work went into isolating different components to prevent one system from affecting others.
Load Testing: Regular load testing helped identify and mitigate potential bottlenecks.
Monitoring and Alerting: Continuous monitoring and alerting systems ensured that any issues were quickly detected and addressed.

Extra Engineering Challenges

The team faced several additional challenges, including:

Third-Party Dependencies: Managing dependencies on external services required careful coordination and contingency planning.
Vertical Growth Spike: The unprecedented growth spike was challenging to predict and manage.
User Behavior: New users added unexpected load by remaining active for longer periods, which the team had to adapt to quickly.

From “GPU Constrained” to “Everything Constrained”

A year ago, ChatGPT's primary bottleneck was GPU availability. However, with this bottleneck addressed, new constraints emerged:

Resource Management: As the system scaled, managing all resources (not just GPUs) became a critical challenge.
Bottleneck Shift: The team had to continuously identify and address new bottlenecks as they arose.

Conclusion

The launch of ChatGPT Images was a significant achievement for OpenAI, demonstrating the team's ability to handle massive scalability challenges. By implementing robust load testing, isolating systems, and making on-the-fly changes, they successfully managed to serve 100 million new users without major outages.