GPT-OSS-120B Struggles on LiveBench: What’s Going On?

Models & Research

The Engineer

12 Aug 2025 · 3 min read

Recent tests show GPT-OSS-120B faltering on LiveBench, raising questions about the model's true capabilities and the reliability of public benchmarks in gauging performance accurately.

GPT-OSS-120B Struggles on LiveBench: What’s Going On?

Aug 10, 2025

When it comes to large language models (LLMs), benchmarks are crucial for understanding their capabilities and limitations. However, not all benchmarks are created equal. Public benchmarks can be gamed by training models directly on the questions and answers, while private benchmarks like LiveBench offer a more accurate measure of true performance.

The Benchmarking Process

To evaluate the performance of various LLMs, I followed these steps:

Intersection of Scores: I identified LLMs that had scores on both the Artificial Analysis Intelligence Index (AAI) and LiveBench.
- AAI is a composite score from several public benchmarks. These benchmarks often have publicly available questions and answers, making it possible for labs to train their models specifically to perform well on them.
- LiveBench, on the other hand, is a high-quality private benchmark that releases its questions only after a three-month delay, ensuring they remain secret and reducing the risk of overfitting.
Ranking Models: I ranked the LLMs based on their AAI scores and then by their LiveBench scores.
Difference Calculation: For each model, I calculated the difference between these two rankings to see how their performance changed from public to private benchmarks.

The Results

The chart below summarizes the findings:

Negative Numbers (Red): Models that moved down in ranking on LiveBench compared to AAI. This suggests they may have been overfitted to public benchmark questions.
Positive Numbers (Green): Models that improved their ranking on LiveBench, indicating better generalization.

GPT-OSS-120B: The Outlier

The new GPT-OSS-120B stands out as the model with the most significant drop in performance on LiveBench. Here are the details:

AAI Ranking: 9th place (tied with DeepSeek R1 and Claude 4 Sonnet Thinking)
LiveBench Ranking: 24th place
Ranking Drop: 15 positions

This is concerning, especially when compared to smaller models like Qwen 3 variants (dense 32B and sparse 30B-A3B), which have only a quarter of the parameter count but outperform GPT-OSS-120B on LiveBench. These smaller models can even run efficiently on a two-year-old laptop.

What’s Going On?

The significant drop in performance suggests that GPT-OSS-120B may have been overfitted to public benchmark questions. This means the model was trained or fine-tuned in a way that improved its scores on these benchmarks but did not enhance its true generalization capabilities.

Bias and Context

I care deeply about American AI labs, particularly OpenAI, releasing open-weights models that are genuinely useful. The recent surge of strong, permissively-licensed Chinese models has raised the bar significantly. Models like DeepSeek R1 and Qwen 3 variants continue to shine in both performance and accessibility.

Conclusion

The performance of GPT-OSS-120B on LiveBench is a red flag. It highlights the importance of using private benchmarks to ensure that models are not just optimized for public scores but are genuinely capable. As the AI landscape continues to evolve, it’s crucial for labs to focus on building models that generalize well and provide real value.