
Share
Recent tests show GPT-OSS-120B faltering on LiveBench, raising questions about the model's true capabilities and the reliability of public benchmarks in gauging performance accurately.
Aug 10, 2025
When it comes to large language models (LLMs), benchmarks are crucial for understanding their capabilities and limitations. However, not all benchmarks are created equal. Public benchmarks can be gamed by training models directly on the questions and answers, while private benchmarks like LiveBench offer a more accurate measure of true performance.
To evaluate the performance of various LLMs, I followed these steps:
Intersection of Scores: I identified LLMs that had scores on both the Artificial Analysis Intelligence Index (AAI) and LiveBench.
Ranking Models: I ranked the LLMs based on their AAI scores and then by their LiveBench scores.
Difference Calculation: For each model, I calculated the difference between these two rankings to see how their performance changed from public to private benchmarks.
The chart below summarizes the findings:

The new GPT-OSS-120B stands out as the model with the most significant drop in performance on LiveBench. Here are the details:
This is concerning, especially when compared to smaller models like Qwen 3 variants (dense 32B and sparse 30B-A3B), which have only a quarter of the parameter count but outperform GPT-OSS-120B on LiveBench. These smaller models can even run efficiently on a two-year-old laptop.
The significant drop in performance suggests that GPT-OSS-120B may have been overfitted to public benchmark questions. This means the model was trained or fine-tuned in a way that improved its scores on these benchmarks but did not enhance its true generalization capabilities.
I care deeply about American AI labs, particularly OpenAI, releasing open-weights models that are genuinely useful. The recent surge of strong, permissively-licensed Chinese models has raised the bar significantly. Models like DeepSeek R1 and Qwen 3 variants continue to shine in both performance and accessibility.
The performance of GPT-OSS-120B on LiveBench is a red flag. It highlights the importance of using private benchmarks to ensure that models are not just optimized for public scores but are genuinely capable. As the AI landscape continues to evolve, it’s crucial for labs to focus on building models that generalize well and provide real value.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
12 August 2025
88 articles
Related Articles
Related Articles
More Stories