
Share
A 3 billion parameter language model from Sina Weibo is making waves by matching or exceeding the performance of much larger models, sparking debate over the future of AI benchmarks.
On Sunday, a team of nine researchers at Sina Weibo, China’s leading microblogging platform, quietly published a 14-page technical report on arXiv that has sent shockwaves through the AI research community. The claim: their language model, VibeThinker-3B, with just 3 billion parameters, can match or exceed the reasoning performance of flagship systems from Google DeepMind, OpenAI, Anthropic, and DeepSeek, which are hundreds of times larger.
VibeThinker-3B scored 94.3 on AIME 2026, the American Invitational Mathematics Examination, a highly demanding standardized math competition. This score places it alongside DeepSeek V3.2, a model with 671 billion parameters, and ahead of Google's Gemini 3 Pro, which scored 91.7. Using a test-time scaling technique called Claim-Level Reliability Assessment, the score climbs to 97.1, outperforming virtually every system in the public record.
The reaction was swift and mixed. Within hours, the paper had garnered 62 upvotes on Hugging Face's daily papers feed, the model repository had accumulated 130 likes, and the GitHub repository had reached 685 stars. However, social media responses were not uniformly positive. User @orcus108 on X wrote, "WHAT THE HELL is happening in AI? A 3B parameter model just put up coding benchmark scores in the same league as Claude Opus 4.5… I genuinely don't know if this is a breakthrough or if the benchmarks are broken."
The VibeThinker-3B story highlights the ongoing tension between genuine scientific advancement and the growing suspicion that AI benchmarks have become gameable to the point of meaninglessness. This debate is crucial not just for academic prestige but for the multibillion-dollar question of whether the industry's relentless push toward ever-larger models is the only path to intelligence.

The technical report provides detailed benchmarks and implementation notes, which have been crucial in validating the model’s performance. However, the community remains divided on whether these results indicate a genuine breakthrough or a clever optimization of existing benchmarks.
The VibeThinker-3B story is a reminder that innovation in AI can come from unexpected sources and that the path to intelligence may not be as straightforward as simply increasing model size. As the debate continues, one thing is clear: the future of AI benchmarks and model efficiency remains an open and exciting field of research.
Tags
Original Sources
Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again
↗ https://venturebeat.com/technology/why-weibos-tiny-vibethinker-3b-has-the-ai-world-arguing-over-benchmarks-again
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
23 June 2026
67 articles
Related Articles
Related Articles
More Stories
© 2026 Cedar & Bloom. All rights reserved.