
Share
Bunny-Llama-3-8B-V combines the efficiency of SigLIP’s vision encoder with Llama-3’s language backbone, offering a compact solution for complex multimodal tasks like image captioning and visual question answering.
The latest addition to the Bunny family of multimodal models, Bunny-Llama-3-8B-V, is a lightweight yet powerful model designed for tasks that require both visual and textual understanding. This model integrates the SigLIP vision encoder with the Llama-3-8B language backbone, making it a versatile tool for applications like image captioning, visual question answering, and more.
To get started with Bunny-Llama-3-8B-V, you'll need to install the following dependencies:
pip install torch transformers accelerate pillow
For optimal performance on a GPU, set the CUDA_VISIBLE_DEVICES environment variable:
export CUDA_VISIBLE_DEVICES=0
Here's a code snippet to help you use the model with Hugging Face Transformers:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings
# Disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')
# Set device (GPU or CPU)
device = 'cuda' # or 'cpu'
torch.set_default_device(device)
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
'BAAI/Bunny-Llama-3-8B-V',
torch_dtype=torch.float16, # Use float32 for CPU
device_map='auto',
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
'BAAI/Bunny-Llama-3-8B-V',
trust_remote_code=True
)
# Prepare the text prompt
prompt = 'Why is the image funny?'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(device)
# Load and process the image
image = Image.open('example_2.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)
# Generate the response
output_ids = model.generate(
input_ids,
images=image_tensor,
max_new_tokens=100,
use_cache=True # Optional: Use cached results for faster generation
)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)
Bunny-Llama-3-8B-V has been benchmarked against other state-of-the-art multimodal models, showing competitive performance in various tasks. The integration of SigLIP and Llama-3-8B-Instruct allows the model to achieve high accuracy while maintaining efficiency.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
26 April 2024
88 articles
Related Articles
Related Articles
More Stories