HEADLINE: LLaVA-NeXT: Enhanced Multimodal Reasoning and OCR Capabilities Exceed Gemini Pro on Several Benchmarks

Models & Research

The Engineer

1 Feb 2024 · 3 min read

LLaVA-NeXT surpasses Gemini Pro in visual reasoning and OCR tasks, offering higher image resolution support and enhanced world knowledge, making it a powerful tool for researchers and developers.

In the rapidly evolving landscape of large multimodal models (LMMs), the team behind LLaVA has once again pushed the boundaries with the release of LLaVA-NeXT. Building upon the success of LLaVA-1.5, which was released in October 2023, this new iteration brings significant improvements in visual reasoning, OCR capabilities, and world knowledge. Notably, LLaVA-NeXT outperforms Gemini Pro on several benchmarks, making it a compelling choice for practitioners and researchers alike.

Key Improvements

1. Higher Input Image Resolution

LLaVA-NeXT supports images with up to 4x more pixels than its predecessor, allowing the model to capture more visual details. It can handle three aspect ratios: 672x672, 336x1344, and 1344x336. This higher resolution is crucial for tasks that require fine-grained visual analysis.

2. Enhanced Visual Reasoning and OCR

The model's visual reasoning and OCR capabilities have been significantly improved through an optimized visual instruction tuning data mixture. This enhancement ensures better performance in scenarios where text within images needs to be accurately recognized and understood.

3. Improved Visual Conversation for Diverse Scenarios

LLaVA-NeXT excels in various applications, thanks to its enhanced world knowledge and logical reasoning. It can engage in more complex visual conversations, making it suitable for a broader range of use cases.

4. Efficient Deployment and Inference

The model leverages SGLang (Sparse Graph Language), a framework designed for efficient deployment and inference. This ensures that LLaVA-NeXT can be integrated into production environments with minimal overhead.

Performance and Efficiency

Despite these enhancements, LLaVA-NeXT maintains the minimalist design and data efficiency of its predecessor. It reuses the pretrained connector from LLaVA-1.5 and requires fewer than 1M visual instruction tuning samples. The largest variant, with 34 billion parameters, can be trained in approximately one day using 32 A100 GPUs.

Open-Source Release

To foster further development in the community, the team has open-sourced LLaVA-NeXT. This includes the code, data, and models, which will be made publicly available. You can explore the following resources:

Demo: Try it out
Code: GitHub Repository (Training code coming soon)
Model: Model Zoo
Data: Coming soon

Benchmark Results

The performance of LLaVA-NeXT has been evaluated against several benchmarks, and it consistently outperforms other models in various tasks. Here are some key results:

| Data (PT) | Data (IT) | Model | MMMU (val) | Math-Vista | MMB-ENG | MMB-CN | MM-Vet | LLaVA-Wild | SEED-IMG | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | N/A | N/A | GPT-4V | 56.8 | 49.9 | 75.8 | 73.9 | 67.6 | - | 71.6 | | N/A | N/A | Gemini Ultra | 59.4 | 53 | - | - | - | - | - | | N/A | N/A | Gemini Pro | 47.9 | 45.2 | 73.6 | 74.3 | 64.3 | - | 70.7 | | 1.4B | 50M | Qwen-VL-Plus | 45.2 | 43.3 | - | - | 55.7 | - | 65.7