Grok-1.5: Enhanced Reasoning and Long Context Understanding for LLMs

Models & Research

The Engineer

1 Apr 2024 · 2 min read

Grok-1.5 revolutionizes LLM capabilities with advanced reasoning and long context understanding, doubling its token capacity to 128,000 for more nuanced coding and math problem-solving.

March 28, 2024

Grok-1.5: Enhanced Reasoning and Long Context Understanding for LLMs

Grok-1.5 is the latest iteration of xAI's large language model (LLM), boasting significant improvements in reasoning capabilities and an extended context length of up to 128,000 tokens. This update will be available to early testers and existing Grok users on the 𝕏 platform soon.

Capabilities and Reasoning

One of the most notable enhancements in Grok-1.5 is its performance in coding and math-related tasks. Here are the key benchmarks:

MATH Benchmark: Grok-1.5 scored 50.6% (4-shot), a substantial improvement over Grok-1's 23.9%.
GSM8K Benchmark: It achieved a remarkable 90% score (8-shot), compared to Grok-1’s 62.9%.
HumanEval Benchmark: The model scored 74.1% (0-shot) in code generation and problem-solving, up from Grok-1's 63.2%.

| Benchmark | Grok-1 | Grok-1.5 | Mistral Large | Claude 2 | Claude 3 Sonnet | Gemini Pro 1.5 | GPT-4 | Claude 3 Opus | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | MMLU (73% 5-shot) | 81.3% 5-shot | 81.2% 5-shot | 75% 5-shot | 79% 5-shot | 83.7% 5-shot | 86.4% 5-shot | 86.8% 5-shot | | MATH (23.9% 4-shot) | 50.6% 4-shot |, |, | 40.5% 4-shot | 58.5% 4-shot | 52.9% 4-shot | 61% 4-shot | | GSM8K (62.9 8-shot) | 90% 8-shot | 81% 5-shot | 88% 0-shot CoT | 92.3% 0-shot CoT | 91.7% 11-shot | 92% 5-shot | 95% 0-shot CoT | | HumanEval (63.2% 0-shot) | 74.1% 0-shot | 45.1% 0-shot | 70% 0-shot | 73% 0-shot | 71.9% 0-shot | 67% 0-shot | 84.9% 0-shot |

Long Context Understanding

Grok-1.5 introduces a significant enhancement in long context understanding, with the ability to process contexts of up to 128,000 tokens. This represents a 16x increase in memory capacity compared to its predecessor. The extended context window allows Grok to handle longer and more complex documents while maintaining its instruction-following capabilities.

Needle In A Haystack (NIAH) Evaluation: Grok-1.5 demonstrated perfect retrieval results for embedded text within contexts of up to 128,000 tokens. This capability is crucial for tasks requiring deep understanding and information extraction from extensive documents.

Grok-1.5 Infrastructure

The development of cutting-edge LLMs like Grok-1.5 requires robust and flexible infrastructure capable of handling massive GPU clusters. Here are some key points about the infrastructure:

Custom Distributed Training Framework: Grok-1.5 is built on a custom distributed training framework designed to optimize performance and scalability