OpenAI's o1: A Deep Dive into Long Chain Thinking and Test-Time Compute

Models & Research

The Engineer

11 Dec 2024 · 3 min read

O1 challenges conventional AI models by prioritizing lengthy internal thought processes over rapid responses, revolutionizing complex tasks like mathematics and coding with enhanced reasoning capabilities.

OpenAI’s latest model, o1, has been making waves in the AI community. While it was initially believed to be a post-trained version of GPT-4o, the real intrigue lies in its unique approach to generating and exploring long internal chains of thought before responding. This shift is particularly significant for tasks that require deeper reasoning, such as math, coding, and research.

The Technical Shift

Why Train for Long Chains of Thought?

Some problems inherently demand more "thinking time" or partial work to reach the correct solution. This is especially true in domains like mathematics, programming, and scientific research. By training a model to use tokens specifically for thinking, o1 aims to achieve higher performance at the expense of longer generation times.

Key Findings from OpenAI

According to OpenAI’s launch post and system card, o1 shows a clear preference over GPT-4o in complex tasks:

Programming and Math Calculation: o1 outperforms GPT-4o.
Simple Text Editing: GPT-4o is preferred.

The Compute Perspective

GPT-4’s Implementation

When GPT-4 was first announced, OpenAI kept its implementation details under wraps. Over time, it became clear that GPT-4 was a mixture-of-experts (MoE) model, combining 8 copies of a 220B parameter model. This approach was initially seen as a fallback when more innovative ideas were exhausted.

George Hotz famously quipped, “mixture[-of-experts] models are what you do when you run out of ideas”. However, the reality is more nuanced. The Switch Transformers paper demonstrated that MoE models have distinct scaling properties and can be more compute-efficient.

o1’s Test-Time Compute Strategy

o1 takes a different approach by focusing on how it uses compute at test-time-specifically, the number of response tokens generated for each user query. Here are the key points:

Compute Budget: o1 may not perform as well with a standard 1-shot budget compared to a typical LLM.
Larger Compute Budgets: It is trained to excel when given more test-time compute, making it better at spending larger budgets effectively.

This strategy shifts the paradigm from simply increasing model size to optimizing how compute is allocated during inference. Instead of building a bigger bowl (i.e., a larger model), o1 uses multiple bowls (i.e., more tokens) to achieve better results.

Practical Implications

For practitioners, this means:

Complex Tasks: o1 is a better fit for tasks that require deep reasoning and multiple steps.
Resource Allocation: You need to commit to providing more compute at test-time to see the full benefits of o1.
Performance Trade-offs: While it may not be as efficient for simple tasks, o1 can outperform GPT-4o in scenarios where deeper thinking is necessary.

Conclusion

OpenAI’s o1 represents a significant step forward in how we approach long chain thinking and test-time compute. By training models to use tokens more effectively for reasoning, o1 opens up new possibilities for solving complex problems. As the AI landscape continues to evolve, it will be interesting to see how this approach influences future model architectures and applications.