Four weeks ago, GPT-4 was the undisputed leader in the large language model (LLM) space. It consistently topped benchmarks and was widely regarded as the most capable default model for a wide range of tasks. However, that dominance has been shaken by the recent release of four new models from different vendors, each pushing the boundaries of what LLMs can achieve.
New Entrants to the LLM Race
-
Google Gemini 1.5 (Released: February 15th)
- Key Feature: One million token context length, nearly 8 times that of GPT-4 Turbo.
- Video Processing: Breaks videos into one frame per second, with each frame represented by 258 tokens. This allows for extensive video processing capabilities within its massive context window.
- Impact: While not outperforming GPT-4 in every benchmark, Gemini 1.5's unique features make it a strong contender for specific use cases.
-
Mistral Large (Released: February 26th)
- Openly Licensed Models: Mistral has gained a reputation for its exceptional openly licensed models, such as the 7B model that runs on mobile devices and the Mixtral-8x7B model that performs well on laptops.
- Performance: The hosted Large model is in the same performance class as GPT-4, though it may not outperform it across all metrics.
- Future Potential: Mistral's commitment to open licensing and their track record of innovation make them a vendor to watch.
-
Claude 3 Opus (Released: March 4th)
- Vibes: Strong positive feedback from evaluators, with many rating it as the first clear GPT-4 beater.
- Use Case: Particularly strong in code generation. For instance, complex prompts that produced broken JavaScript in GPT-4 yielded perfect working answers in Claude 3 Opus.
- System Prompt: Anthropic research engineer Amanda Askell provided a detailed breakdown of the system prompt, offering insights into its design and effectiveness.
-
Inflection-2.5 (Released: March 7th)
- Surprise Entry: Inflection, known for their conversation-focused chat interface Pi, has introduced a new model that benchmarks favorably against GPT-4.
- Performance: While initially seen as more gimmicky, Inflection-2.5's performance metrics have caught the attention of the LLM community.
Technical Details and Benchmarks
Each of these models brings unique technical advancements to the table:
- Gemini 1.5:
- Context Length: One million tokens.
- Video Processing: Efficiently handles video by breaking it into frames, each represented by 258 tokens.
- Use Case: Ideal for applications requiring extensive context or multimedia processing.

-
Mistral Large:
- Open Licensing: Models like Mistral 7B and Mixtral-8x7B are available for use on a wide range of devices.
- Performance Parity: Competes with GPT-4 in terms of general capabilities, making it a viable alternative.
-
Claude 3 Opus:
- Code Generation: Superior performance in generating functional code, addressing a common pain point with other models.
- System Prompt Design: Detailed and well-crafted, contributing to its overall effectiveness.
-
Inflection-2.5:
- Conversation Focus: Built on the foundation of Pi, which emphasizes natural language interaction.
- Benchmark Performance: Competitive with GPT-4 in various metrics, indicating strong general capabilities.
Impact on Practitioners
The emergence of these new models has significant implications for practitioners:
- Diverse Options: More choices mean developers can select the model best suited to their specific needs, whether it's context length, video processing, or code generation.
- Innovation Drive: Competition is likely to accelerate innovation, pushing vendors to continually improve their models.
- Open Licensing: Models like those from Mistral encourage community-driven development and accessibility.
Conclusion
The landscape of large