Exploring Long-Horizon Tasks with GPT-5.3-Codex: A 25-Hour Coding Sprint

Products & Applications

The Engineer

24 Feb 2026 · 3 min read

This ambitious test challenges GPT-5.3-Codex to code continuously for 25 hours without human intervention, revealing its potential as a revolutionary tool for complex software development projects.

In September 2025, OpenAI introduced GPT-5-Codex as the first version of GPT-5 optimized for agentic coding. By December 2025, they launched GPT-5.2, which marked a significant milestone in the reliability of autonomous coding agents. The key improvement was the model's ability to reliably follow instructions over extended periods, a capability that has profound implications for developers and software projects.

Testing Codex’s Long-Horizon Capabilities

To push these boundaries, I conducted an experiment where GPT-5.3-Codex was given a blank repository, full access, and a single task: build a design tool from scratch. The model ran uninterrupted for about 25 hours, using approximately 13 million tokens and generating around 30,000 lines of code. This experiment wasn't a production rollout but served to evaluate Codex's performance on critical aspects of long-horizon work:

Following the Spec: Codex maintained coherence and adherence to the initial specifications throughout the session.
Staying on Task: The model stayed focused on the task at hand, demonstrating the ability to manage complex projects over extended periods.
Running Verification: It performed ongoing verification checks to ensure the code met quality standards.
Repairing Failures: Codex was capable of identifying and fixing errors without losing track of the overall project.

What a Long-Run Codex Session Looks Like

To better understand the dynamics of this long-run session, I asked Codex to generate a summary page for the session data. Here’s a view of the CLI session stats and token usage:

Session Duration: 25 hours
Tokens Used: 13 million
Lines of Code Generated: 30,000

These screenshots are particularly useful because they highlight a core shift in agentic coding: it's increasingly about time horizon rather than just one-shot intelligence.

The Real Shift is Time Horizon

The improvements in GPT-5.3-Codex aren't just about getting smarter; the practical change is that agents can stay coherent for longer, complete larger chunks of work end-to-end, and recover from errors without losing the thread. This shift has significant implications for software development:

Project Management: Codex can handle more complex and extended tasks, reducing the need for constant human oversight.
Error Handling: The model's ability to self-correct and maintain coherence over long periods means fewer interruptions and higher overall productivity.
Quality Assurance: Continuous verification checks ensure that the code meets high standards throughout the development process.

Time-Horizon Benchmarks

METR’s work on time-horizon benchmarks provides a helpful framework for understanding this trend. According to their research, the length of software tasks that frontier agents can complete with 50% and 80% reliability has been increasing rapidly, with a rough doubling time of about 7 months. This rapid improvement suggests that autonomous coding agents are becoming more viable for real-world applications.

Conclusion

The experiment with GPT-5.3-Codex demonstrates the growing capabilities of agentic coding models to handle long-horizon tasks. As these models continue to improve, they have the potential to revolutionize software development by reducing human intervention and increasing productivity. For developers, this means more efficient and reliable tools for building complex applications.