Vision Language Models: The Year of Smaller, Stronger, and More Capable Architectures

Models & Research

The Engineer

13 May 2025 · 4 min read

As VLMs shrink in size and surge in capability, 2025 witnesses a proliferation of innovative architectures that seamlessly blend vision and language, pushing the boundaries of AI's multimodal prowess.

Published: May 12, 2025

Vision Language Models (VLMs) have been a hot topic in the AI community for quite some time. In our previous blog post from April 2024, we delved into the world of VLMs, focusing on LLaVA, one of the first successful and easily reproducible open-source vision language models. We also shared tips on discovering, evaluating, and fine-tuning these models.

Since then, the landscape has evolved dramatically. Models have become smaller yet more powerful, new architectures have emerged, and specialized capabilities like reasoning and multimodal agency are now commonplace. In this post, we’ll break down the key changes and emerging trends in VLMs over the past year.

New Model Trends

Any-to-Any Models

One of the most exciting developments is the rise of any-to-any models. These models can handle a wide range of input types (text, images, videos) and generate outputs in various formats. This versatility makes them incredibly useful for a broad spectrum of applications, from content generation to complex reasoning tasks.

Reasoning Models

Reasoning has become a crucial capability for VLMs. New architectures are designed to understand the context and relationships between different elements in multimodal inputs. For example, models can now reason about object interactions in images or videos, making them more adept at tasks like visual question answering and scene understanding.

Key Architectures: Models like VisualBERT and ViLT have been extended to incorporate reasoning capabilities.
Benchmarks: These models have shown significant improvements on tasks like VQA (Visual Question Answering) and NLVR² (Natural Language Visual Reasoning).

Smol Yet Capable Models

The trend towards smaller, more efficient models continues. Researchers are now achieving state-of-the-art performance with models that require fewer parameters and less computational resources. This is particularly important for deploying VLMs in resource-constrained environments.

Notable Examples: SmolVLM has achieved impressive results on a variety of tasks while being significantly smaller than its predecessors.
Implementation Notes: Techniques like knowledge distillation and pruning have been crucial in reducing model size without sacrificing performance.

Mixture-of-Experts as Decoders

Another significant trend is the use of mixture-of-experts (MoE) architectures as decoders. MoE models dynamically select the most appropriate expert for a given task, allowing them to handle diverse inputs more effectively.

Benefits: This approach leads to better specialization and improved performance on complex tasks.
Challenges: The increased complexity can make these models harder to train and optimize.

Vision Language Action Models

Vision Language Action (VLA) models are a new class of VLMs that integrate action understanding into the mix. These models can not only understand visual and textual inputs but also predict and generate actions based on those inputs.

Applications: VLA models have shown promise in areas like robotics, where they can help robots understand and execute complex tasks.
Research Directions: Current research is focused on improving the temporal understanding of these models to handle long video sequences more effectively.

Specialized Capabilities

Object Detection, Segmentation, Counting with Vision Language Models

VLMs are now capable of performing advanced computer vision tasks like object detection, segmentation, and counting. This integration of language and vision capabilities opens up new possibilities for applications in fields like autonomous driving and medical imaging.

Architectures: Models like DETR have been adapted to incorporate language understanding.
Performance: These models have shown competitive results on benchmarks like COCO (Common Objects in Context).

Conclusion

The past year has seen significant advancements in vision language models, with a focus on making them smaller, more powerful, and capable of handling complex tasks. The rise of any-to-any models, reasoning capabilities, and specialized architectures like VLA models is reshaping the landscape of AI research and