
Share
X.ai's Grok-2 beta release showcases superior chat, coding, and reasoning skills, outpacing rivals Claude 3.5 Sonnet and GPT-4-Turbo in benchmark tests, marking a significant leap in language model capabilities.
On August 13, 2024, X.ai announced the beta release of Grok-2 and its smaller variant, Grok-2 mini. These new models represent a significant advancement over their predecessor, Grok-1.5, with notable improvements in chat, coding, and reasoning capabilities. An early version of Grok-2, known as "sus-column-r," has already been tested on the LMSYS leaderboard, where it outperformed both Claude 3.5 Sonnet and GPT-4-Turbo.
Grok-2 was introduced into the LMArena.ai Chatbot Arena, a popular competitive benchmark for language models. It outperformed both Claude and GPT-4 on the LMSYS leaderboard in terms of overall Elo score. The model's response quality is particularly noteworthy, demonstrating significant improvements in reasoning with retrieved content and tool use capabilities.
Overall ELO Scores:
Win Rate Against Competing Models:
Internally, X.ai employs a rigorous evaluation process to assess model performance. AI Tutors engage with the models across various tasks that mimic real-world interactions. During each interaction, two responses are generated by Grok, and the superior response is selected based on specific criteria outlined in guidelines. The focus areas for evaluation include:
Grok-2 has shown significant improvements in these areas, particularly in reasoning with retrieved content, identifying missing information, reasoning through sequences of events, and discarding irrelevant posts.

The Grok-2 models were evaluated across a series of academic benchmarks, including reasoning, reading comprehension, math, science, and coding. Both Grok-2 and Grok-2 mini demonstrated significant improvements over the previous Grok-1.5 model, achieving performance levels competitive with other frontier models.
Graduate-Level Science Knowledge (GPQA):
General Knowledge (MMLU, MMLU-Pro):
Math Competition Problems (MATH):
Vision-Based Tasks:
Grok-2 and Grok-2 mini are currently available in beta on 𝕏. They will also be made available through X.ai's enterprise API later this month, providing businesses with access to these advanced capabilities.
The release of Grok-2 marks a significant step forward in the field of language models, offering enhanced chat, coding, and reasoning capabilities. With its strong performance across various benchmarks and real-world applications, Grok-2 is poised to become a leading choice for developers and businesses alike.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
15 August 2024
88 articles
Related Articles
Related Articles
More Stories