
Share
LLaVA-NeXT surpasses Gemini Pro in visual reasoning and OCR tasks, offering higher image resolution support and enhanced world knowledge, making it a powerful tool for researchers and developers.
In the rapidly evolving landscape of large multimodal models (LMMs), the team behind LLaVA has once again pushed the boundaries with the release of LLaVA-NeXT. Building upon the success of LLaVA-1.5, which was released in October 2023, this new iteration brings significant improvements in visual reasoning, OCR capabilities, and world knowledge. Notably, LLaVA-NeXT outperforms Gemini Pro on several benchmarks, making it a compelling choice for practitioners and researchers alike.
LLaVA-NeXT supports images with up to 4x more pixels than its predecessor, allowing the model to capture more visual details. It can handle three aspect ratios: 672x672, 336x1344, and 1344x336. This higher resolution is crucial for tasks that require fine-grained visual analysis.
The model's visual reasoning and OCR capabilities have been significantly improved through an optimized visual instruction tuning data mixture. This enhancement ensures better performance in scenarios where text within images needs to be accurately recognized and understood.
LLaVA-NeXT excels in various applications, thanks to its enhanced world knowledge and logical reasoning. It can engage in more complex visual conversations, making it suitable for a broader range of use cases.
The model leverages SGLang (Sparse Graph Language), a framework designed for efficient deployment and inference. This ensures that LLaVA-NeXT can be integrated into production environments with minimal overhead.

Despite these enhancements, LLaVA-NeXT maintains the minimalist design and data efficiency of its predecessor. It reuses the pretrained connector from LLaVA-1.5 and requires fewer than 1M visual instruction tuning samples. The largest variant, with 34 billion parameters, can be trained in approximately one day using 32 A100 GPUs.
To foster further development in the community, the team has open-sourced LLaVA-NeXT. This includes the code, data, and models, which will be made publicly available. You can explore the following resources:
The performance of LLaVA-NeXT has been evaluated against several benchmarks, and it consistently outperforms other models in various tasks. Here are some key results:
| Data (PT) | Data (IT) | Model | MMMU (val) | Math-Vista | MMB-ENG | MMB-CN | MM-Vet | LLaVA-Wild | SEED-IMG | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | N/A | N/A | GPT-4V | 56.8 | 49.9 | 75.8 | 73.9 | 67.6 | - | 71.6 | | N/A | N/A | Gemini Ultra | 59.4 | 53 | - | - | - | - | - | | N/A | N/A | Gemini Pro | 47.9 | 45.2 | 73.6 | 74.3 | 64.3 | - | 70.7 | | 1.4B | 50M | Qwen-VL-Plus | 45.2 | 43.3 | - | - | 55.7 | - | 65.7
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
1 February 2024
88 articles
Related Articles
Related Articles
More Stories