LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Models & Research

The Engineer

25 Mar 2025 · 3 min read

This study disentangles the impact of language models and visual components on multimodal AI performance, offering insights into optimizing model architecture beyond just scaling parameters.

Recent advancements in Multimodal Large Language Models (MLLMs) have underscored the importance of both the visual backbone and the underlying language model. While much of the prior research has focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain largely unexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it challenging to derive optimal design choices.

In a new paper titled "LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning," researchers from the Image and Sound Analysis Lab (ISAL) at the University of Modena and Reggio Emilia introduce LLaVA-MORE, a family of MLLMs that integrates recent language models with diverse visual backbones. The paper aims to provide a comprehensive analysis of these components, offering insights into designing more effective MLLMs.

Key Technical Changes and Why They Matter

Unified Training Protocol: One of the main technical contributions is the introduction of a unified training protocol applied consistently across all architectures. This ensures fair comparisons, which has been a significant challenge in previous studies due to varying training setups.
Diverse Model Integration: The study systematically evaluates small- and medium-scale LLMs, including Phi-4, LLaMA-3.1, and Gemma-2, alongside various visual encoders such as CLIP, DINOv2, SigLIP, and SigLIP2. This diversity allows for a more nuanced understanding of how different components interact.

Key Findings

Model Size vs. Performance: The researchers found that while larger models generally perform better, the performance gains diminish beyond a certain point. For example, Phi-4 showed significant improvements over smaller models in multimodal reasoning and generation tasks, but the benefits of further scaling were marginal.
Visual Backbone Impact: Different visual backbones had varying impacts on performance. CLIP-based architectures performed well across most tasks, while DINOv2 excelled in image captioning and SigLIP2 showed strong results in object detection.

Implementation Details

Training Data: The models were trained using a unified dataset that combines multiple sources to ensure consistency. This includes the COCO, VQAv2, and Visual Genome datasets.
Evaluation Metrics: Performance was evaluated using standard metrics for multimodal tasks such as BLEU, ROUGE, and CIDEr for text generation, and mAP (mean Average Precision) for object detection.

Additional Experiments

Image Resolution: The study also explored the effects of increased image resolution on performance. Results showed that higher resolution images generally improved performance, but with diminishing returns beyond a certain point.
Pre-training Datasets: Variations in pre-training datasets were examined to understand their impact on final model performance. Pre-training on larger and more diverse datasets consistently led to better results.

Reproducibility

To facilitate reproducibility and future research, the authors have made their source code and trained models publicly available at this GitHub repository.

Conclusion

The LLaVA-MORE study provides a comprehensive framework for evaluating MLLMs, highlighting the importance of both language model size and visual backbone architecture. By introducing a unified training protocol and systematically analyzing various components, the researchers offer valuable insights that can guide the development of more effective multimodal models.