OpenAI Quietly Funded Math Benchmark Before Setting Record with o3

Models & Research

The Engineer

21 Jan 2025 · 3 min read

OpenAI's reveal of its record-breaking o3 model on the FrontierMath benchmark sparked controversy when it emerged they had secretly funded the benchmark's development, stirring debates over transparency and fairness in AI evaluations.

On December 20, OpenAI unveiled its new model, o3, which achieved an unprecedented 25.2% success rate on the FrontierMath benchmark-a massive leap from previous models that struggled to solve more than 2% of the problems. What wasn't immediately clear was that OpenAI had quietly funded the development of this very benchmark, raising questions about transparency and potential conflicts of interest.

The Benchmark: FrontierMath

FrontierMath, introduced in November 2024 by Epoch AI, is a rigorous test designed to evaluate how well AI systems can tackle complex mathematical problems. These problems require advanced reasoning and problem-solving skills-tasks that typically stump even the most sophisticated AI models. The benchmark was created by a team of over 60 leading mathematicians.

OpenAI's Involvement

The connection between OpenAI and FrontierMath only came to light after o3’s announcement. Epoch AI had signed an agreement with OpenAI, which prevented them from disclosing the financial support until after o3's launch. This agreement was revealed in a footnote added during the fifth update of their research paper on arXiv.

Agreement Details:
- Prevented Epoch AI from revealing OpenAI’s financial support
- Allowed OpenAI access to "much but not all" FrontierMath benchmark data

OpenAI highlighted o3's performance on FrontierMath in its marketing materials, touting the model's success as a significant breakthrough in AI reasoning capabilities.

Mathematicians' Perspective

The mathematicians who contributed to the benchmark were largely unaware of OpenAI’s involvement. According to a post on LessWrong, these experts had signed non-disclosure agreements (NDAs) that only covered keeping the problems themselves confidential. Most believed their work would remain private and be used exclusively by Epoch AI.

Mathematician NDAs:
- Covered confidentiality of the benchmark problems
- Did not disclose OpenAI’s involvement

Transparency Issues

Epoch AI acknowledged that they should have been more transparent about their relationship with OpenAI. The lack of transparency has raised concerns among the AI community about potential biases and the integrity of benchmark results.

Concerns:
- Potential bias in benchmark design
- Integrity of performance claims
- Ethical considerations in AI research funding

Implications for the AI Community

The incident highlights the importance of transparency in AI research, especially when it comes to benchmarks that are used to evaluate and compare different models. It also underscores the need for clear communication between all parties involved, including researchers, funders, and contributors.

Key Takeaways:
- Transparency is crucial for maintaining trust in AI benchmark results
- Clear communication can prevent misunderstandings and conflicts of interest
- Ethical considerations should guide funding and collaboration in AI research

While o3’s performance on FrontierMath is undeniably impressive, the circumstances surrounding its development and testing raise important questions that the AI community will need to address moving forward.