
Share
This study disentangles the impact of language models and visual components on multimodal AI performance, offering insights into optimizing model architecture beyond just scaling parameters.
Recent advancements in Multimodal Large Language Models (MLLMs) have underscored the importance of both the visual backbone and the underlying language model. While much of the prior research has focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain largely unexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it challenging to derive optimal design choices.
In a new paper titled "LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning," researchers from the Image and Sound Analysis Lab (ISAL) at the University of Modena and Reggio Emilia introduce LLaVA-MORE, a family of MLLMs that integrates recent language models with diverse visual backbones. The paper aims to provide a comprehensive analysis of these components, offering insights into designing more effective MLLMs.
Unified Training Protocol: One of the main technical contributions is the introduction of a unified training protocol applied consistently across all architectures. This ensures fair comparisons, which has been a significant challenge in previous studies due to varying training setups.
Diverse Model Integration: The study systematically evaluates small- and medium-scale LLMs, including Phi-4, LLaMA-3.1, and Gemma-2, alongside various visual encoders such as CLIP, DINOv2, SigLIP, and SigLIP2. This diversity allows for a more nuanced understanding of how different components interact.
Model Size vs. Performance: The researchers found that while larger models generally perform better, the performance gains diminish beyond a certain point. For example, Phi-4 showed significant improvements over smaller models in multimodal reasoning and generation tasks, but the benefits of further scaling were marginal.
Visual Backbone Impact: Different visual backbones had varying impacts on performance. CLIP-based architectures performed well across most tasks, while DINOv2 excelled in image captioning and SigLIP2 showed strong results in object detection.

Training Data: The models were trained using a unified dataset that combines multiple sources to ensure consistency. This includes the COCO, VQAv2, and Visual Genome datasets.
Evaluation Metrics: Performance was evaluated using standard metrics for multimodal tasks such as BLEU, ROUGE, and CIDEr for text generation, and mAP (mean Average Precision) for object detection.
Image Resolution: The study also explored the effects of increased image resolution on performance. Results showed that higher resolution images generally improved performance, but with diminishing returns beyond a certain point.
Pre-training Datasets: Variations in pre-training datasets were examined to understand their impact on final model performance. Pre-training on larger and more diverse datasets consistently led to better results.
To facilitate reproducibility and future research, the authors have made their source code and trained models publicly available at this GitHub repository.
The LLaVA-MORE study provides a comprehensive framework for evaluating MLLMs, highlighting the importance of both language model size and visual backbone architecture. By introducing a unified training protocol and systematically analyzing various components, the researchers offer valuable insights that can guide the development of more effective multimodal models.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
25 March 2025
88 articles
Related Articles
Related Articles
More Stories