
Share
This ambitious test challenges GPT-5.3-Codex to code continuously for 25 hours without human intervention, revealing its potential as a revolutionary tool for complex software development projects.
In September 2025, OpenAI introduced GPT-5-Codex as the first version of GPT-5 optimized for agentic coding. By December 2025, they launched GPT-5.2, which marked a significant milestone in the reliability of autonomous coding agents. The key improvement was the model's ability to reliably follow instructions over extended periods, a capability that has profound implications for developers and software projects.
To push these boundaries, I conducted an experiment where GPT-5.3-Codex was given a blank repository, full access, and a single task: build a design tool from scratch. The model ran uninterrupted for about 25 hours, using approximately 13 million tokens and generating around 30,000 lines of code. This experiment wasn't a production rollout but served to evaluate Codex's performance on critical aspects of long-horizon work:
To better understand the dynamics of this long-run session, I asked Codex to generate a summary page for the session data. Here’s a view of the CLI session stats and token usage:
These screenshots are particularly useful because they highlight a core shift in agentic coding: it's increasingly about time horizon rather than just one-shot intelligence.

The improvements in GPT-5.3-Codex aren't just about getting smarter; the practical change is that agents can stay coherent for longer, complete larger chunks of work end-to-end, and recover from errors without losing the thread. This shift has significant implications for software development:
METR’s work on time-horizon benchmarks provides a helpful framework for understanding this trend. According to their research, the length of software tasks that frontier agents can complete with 50% and 80% reliability has been increasing rapidly, with a rough doubling time of about 7 months. This rapid improvement suggests that autonomous coding agents are becoming more viable for real-world applications.
The experiment with GPT-5.3-Codex demonstrates the growing capabilities of agentic coding models to handle long-horizon tasks. As these models continue to improve, they have the potential to revolutionize software development by reducing human intervention and increasing productivity. For developers, this means more efficient and reliable tools for building complex applications.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
24 February 2026
133 articles
Related Articles

Smarter Engagement for Stronger Growth: How Payers Can Leverage AI to Do More with Less
Products & Applications · 3 min

Penn Medicine and K Health Deploy AI Clinical Agents to Enhance Patient Care
Products & Applications · 3 min

Wheel and b.well Partner to Build Turnkey AI-First Virtual Care Infrastructure
Products & Applications · 3 min
Related Articles

Smarter Engagement for Stronger Growth: How Payers Can Leverage AI to Do More with Less
Products & Applications · 3 min

Penn Medicine and K Health Deploy AI Clinical Agents to Enhance Patient Care
Products & Applications · 3 min

Wheel and b.well Partner to Build Turnkey AI-First Virtual Care Infrastructure
Products & Applications · 3 min
More Stories