
Share
As VLMs shrink in size and surge in capability, 2025 witnesses a proliferation of innovative architectures that seamlessly blend vision and language, pushing the boundaries of AI's multimodal prowess.
Published: May 12, 2025
Vision Language Models (VLMs) have been a hot topic in the AI community for quite some time. In our previous blog post from April 2024, we delved into the world of VLMs, focusing on LLaVA, one of the first successful and easily reproducible open-source vision language models. We also shared tips on discovering, evaluating, and fine-tuning these models.
Since then, the landscape has evolved dramatically. Models have become smaller yet more powerful, new architectures have emerged, and specialized capabilities like reasoning and multimodal agency are now commonplace. In this post, we’ll break down the key changes and emerging trends in VLMs over the past year.
One of the most exciting developments is the rise of any-to-any models. These models can handle a wide range of input types (text, images, videos) and generate outputs in various formats. This versatility makes them incredibly useful for a broad spectrum of applications, from content generation to complex reasoning tasks.
Reasoning has become a crucial capability for VLMs. New architectures are designed to understand the context and relationships between different elements in multimodal inputs. For example, models can now reason about object interactions in images or videos, making them more adept at tasks like visual question answering and scene understanding.
The trend towards smaller, more efficient models continues. Researchers are now achieving state-of-the-art performance with models that require fewer parameters and less computational resources. This is particularly important for deploying VLMs in resource-constrained environments.

Another significant trend is the use of mixture-of-experts (MoE) architectures as decoders. MoE models dynamically select the most appropriate expert for a given task, allowing them to handle diverse inputs more effectively.
Vision Language Action (VLA) models are a new class of VLMs that integrate action understanding into the mix. These models can not only understand visual and textual inputs but also predict and generate actions based on those inputs.
VLMs are now capable of performing advanced computer vision tasks like object detection, segmentation, and counting. This integration of language and vision capabilities opens up new possibilities for applications in fields like autonomous driving and medical imaging.
The past year has seen significant advancements in vision language models, with a focus on making them smaller, more powerful, and capable of handling complex tasks. The rise of any-to-any models, reasoning capabilities, and specialized architectures like VLA models is reshaping the landscape of AI research and
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
13 May 2025
88 articles
Related Articles
Related Articles
More Stories