
Share
OpenAI's reveal of its record-breaking o3 model on the FrontierMath benchmark sparked controversy when it emerged they had secretly funded the benchmark's development, stirring debates over transparency and fairness in AI evaluations.
On December 20, OpenAI unveiled its new model, o3, which achieved an unprecedented 25.2% success rate on the FrontierMath benchmark-a massive leap from previous models that struggled to solve more than 2% of the problems. What wasn't immediately clear was that OpenAI had quietly funded the development of this very benchmark, raising questions about transparency and potential conflicts of interest.
FrontierMath, introduced in November 2024 by Epoch AI, is a rigorous test designed to evaluate how well AI systems can tackle complex mathematical problems. These problems require advanced reasoning and problem-solving skills-tasks that typically stump even the most sophisticated AI models. The benchmark was created by a team of over 60 leading mathematicians.
The connection between OpenAI and FrontierMath only came to light after o3’s announcement. Epoch AI had signed an agreement with OpenAI, which prevented them from disclosing the financial support until after o3's launch. This agreement was revealed in a footnote added during the fifth update of their research paper on arXiv.
OpenAI highlighted o3's performance on FrontierMath in its marketing materials, touting the model's success as a significant breakthrough in AI reasoning capabilities.
The mathematicians who contributed to the benchmark were largely unaware of OpenAI’s involvement. According to a post on LessWrong, these experts had signed non-disclosure agreements (NDAs) that only covered keeping the problems themselves confidential. Most believed their work would remain private and be used exclusively by Epoch AI.

Epoch AI acknowledged that they should have been more transparent about their relationship with OpenAI. The lack of transparency has raised concerns among the AI community about potential biases and the integrity of benchmark results.
The incident highlights the importance of transparency in AI research, especially when it comes to benchmarks that are used to evaluate and compare different models. It also underscores the need for clear communication between all parties involved, including researchers, funders, and contributors.
While o3’s performance on FrontierMath is undeniably impressive, the circumstances surrounding its development and testing raise important questions that the AI community will need to address moving forward.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
21 January 2025
88 articles
Related Articles
Related Articles
More Stories