Grok-2 Beta Release: Outperforming Competitors with Enhanced Chat, Coding, and Reasoning Capabilities

Models & Research

The Engineer

15 Aug 2024 · 3 min read

X.ai's Grok-2 beta release showcases superior chat, coding, and reasoning skills, outpacing rivals Claude 3.5 Sonnet and GPT-4-Turbo in benchmark tests, marking a significant leap in language model capabilities.

Grok-2 Beta Release: A Significant Leap Forward in Language Models

On August 13, 2024, X.ai announced the beta release of Grok-2 and its smaller variant, Grok-2 mini. These new models represent a significant advancement over their predecessor, Grok-1.5, with notable improvements in chat, coding, and reasoning capabilities. An early version of Grok-2, known as "sus-column-r," has already been tested on the LMSYS leaderboard, where it outperformed both Claude 3.5 Sonnet and GPT-4-Turbo.

Key Benchmarks and Performance Highlights

Chatbot Arena Performance

Grok-2 was introduced into the LMArena.ai Chatbot Arena, a popular competitive benchmark for language models. It outperformed both Claude and GPT-4 on the LMSYS leaderboard in terms of overall Elo score. The model's response quality is particularly noteworthy, demonstrating significant improvements in reasoning with retrieved content and tool use capabilities.

Overall ELO Scores:
- Grok-2: [High Score]
- Claude: [Lower Score]
- GPT-4: [Lower Score]
Win Rate Against Competing Models:
- Grok-2 consistently outperformed other models in head-to-head comparisons.

Internal Evaluation Process

Internally, X.ai employs a rigorous evaluation process to assess model performance. AI Tutors engage with the models across various tasks that mimic real-world interactions. During each interaction, two responses are generated by Grok, and the superior response is selected based on specific criteria outlined in guidelines. The focus areas for evaluation include:

Following Instructions: Ensuring the model can accurately execute user commands.
Providing Accurate Information: Ensuring the model delivers factual and reliable information.

Grok-2 has shown significant improvements in these areas, particularly in reasoning with retrieved content, identifying missing information, reasoning through sequences of events, and discarding irrelevant posts.

Academic Benchmarks

The Grok-2 models were evaluated across a series of academic benchmarks, including reasoning, reading comprehension, math, science, and coding. Both Grok-2 and Grok-2 mini demonstrated significant improvements over the previous Grok-1.5 model, achieving performance levels competitive with other frontier models.

Notable Benchmark Results

Graduate-Level Science Knowledge (GPQA):
- Grok-1.5: 35.9%
- Grok-2 mini: 51.0%
- Grok-2: 56.0%
General Knowledge (MMLU, MMLU-Pro):
- Significant improvements over Grok-1.5.
Math Competition Problems (MATH):
- Improved performance in solving complex math problems.
Vision-Based Tasks:
- State-of-the-art performance in visual math reasoning (MathVista) and document-based question answering (DocVQA).

Availability

Grok-2 and Grok-2 mini are currently available in beta on 𝕏. They will also be made available through X.ai's enterprise API later this month, providing businesses with access to these advanced capabilities.

Conclusion

The release of Grok-2 marks a significant step forward in the field of language models, offering enhanced chat, coding, and reasoning capabilities. With its strong performance across various benchmarks and real-world applications, Grok-2 is poised to become a leading choice for developers and businesses alike.