
Share
Despite its modest size, Imp-v1-3B outperforms larger models in multimodal tasks, thanks to innovative design elements like Phi-2 and SigLIP, making it a standout in efficiency and performance.
In the rapidly evolving landscape of multimodal language models (MLMs), size often correlates with performance. However, the recent release of Imp-v1-3B by MILVLG challenges this notion. This model, packing a mere 3 billion parameters, demonstrates competitive and even superior performance compared to much larger counterparts on various benchmarks. Let's dive into what makes Imp-v1-3B stand out.
Imp-v1-3B is built upon the following components:
These components are combined and fine-tuned on the LLaVA-v1.5 training set, which is known for its high-quality multimodal data. The result is a compact model that delivers impressive results.
transformers, making it easy for developers to integrate into their workflows.Imp-v1-3B was evaluated on nine commonly used benchmarks, including five academic VQA (Visual Question Answering) benchmarks and four other popular datasets. The results are striking:
To get started with Imp-v1-3B, you'll need to install the necessary dependencies. Here’s a step-by-step guide:
pip install transformers # Latest version is fine, but v4.39.2 is recommended
pip install -q pillow accelerate einops

Here's an example of how to use Imp-v1-3B for inference. This code snippet demonstrates how to generate a response to a visual question:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
# Set default device to GPU
torch.set_default_device("cuda")
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"MILVLG/imp-v1-3b",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("MILVLG/imp-v1-3b", trust_remote_code=True)
# Set up the input
text = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat are the colors of the bus in the image? ASSISTANT:"
image = Image.open("images/bus.jpg")
# Tokenize the input text
input_ids = tokenizer(text, return_tensors='pt').input_ids
# Preprocess the image
image_tensor = model.image_preprocess(image)
# Generate the answer
output_ids = model.generate(
input_ids,
max_new_tokens=100,
images=image_tensor,
use_cache=True)[0]
# Decode and print the output
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
The team at MILVLG is committed to continuously improving Imp-v1-3B. They plan to release more versions with enhanced performance and additional features. The detailed technical report and training/evaluation code will be available soon on their GitHub repository.
Imp-v1-3B is a compelling addition to the
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
29 January 2024
88 articles
Related Articles
Related Articles
More Stories