
Share
VSA bridges the gap between advanced vision models and real-world data, enabling more accurate interpretations of unfamiliar images through dynamic web searches, revolutionizing multimodal query resolution.
Vision Search Assistant (VSA) is a groundbreaking framework that leverages the strengths of large Vision Language Models (VLMs) and web agents to address a critical limitation in multimodal search. Traditional VLMs, despite their impressive capabilities, struggle when confronted with unfamiliar visual content. This new approach, developed by researchers from MMLab at CUHK, Shanghai AI Lab, and Tencent, significantly enhances the model's ability to answer questions about unseen images by integrating real-time web knowledge.
The core innovation in VSA is its ability to combine the visual understanding of a VLM with the real-time information retrieval capabilities of a web agent. Here’s how it works:
Correlated Formulation: The VLM identifies critical objects in an image and generates descriptions that consider their relationships. This step ensures that the model captures not just individual elements but also their contextual interactions.
Planning Agent: Using a Large Language Model (LLM), the system formulates sub-questions based on the initial object descriptions. These sub-questions guide the search process to gather relevant information from the web.
Searching Agent: The same LLM analyzes, selects, and summarizes the web pages returned by the search engine. This step ensures that the gathered information is accurate and contextually relevant.
Final Answer Generation: The VSA combines the original image, user prompt, correlated formulation, and web knowledge to generate a final answer. This integration allows the model to provide reliable responses even for novel images.
For practitioners, this framework offers several key advantages:
Enhanced Accuracy: By leveraging real-time web information, VSA can provide more accurate answers to questions about unfamiliar objects or events.
Scalability: The approach is scalable and can be applied to various existing VLMs without the need for frequent retraining, which is computationally expensive.

The VSA framework is built on top of LLaVA-1.6-7B, a well-known VLM. Here are some key implementation details:
Model Architecture:
Web Search Integration:
Benchmarks:
Consider a scenario where a user uploads an image of a rare plant they encountered during a hike and asks for its name and properties. Traditional VLMs might struggle if the plant is not in their training data. However, VSA can identify key features of the plant, search the web for relevant information, and provide a detailed answer about the plant's name, habitat, and uses.
Vision Search Assistant represents a significant step forward in multimodal search by addressing the limitations of traditional VLMs when dealing with unseen visual content. By integrating real-time web knowledge, VSA enhances accuracy, scalability, and open-world capabilities, making it a valuable tool for practitioners working with vision-language models.
Tags
Original Sources
↗ https://cnzzx.github.io/VSA/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
6 February 2025
88 articles
Related Articles
Related Articles
More Stories