
Share
Explore how integrating LLaVA with Ollama brings advanced visual capabilities to private AI setups, transforming text-based interactions into dynamic multimedia experiences.
When GPT4-V was released, it blew everyone away with its ability to interpret visual information like photographs. Now, a team has developed a free-to-use multimodal model called LLaVA (Large Language and Vision Assistant) that you can run on your own infrastructure. I decided to give it a try using my Ollama private instance running on a virtual machine with a 48GB NVIDIA GPU.
First, I updated my Ollama instance to the latest 0.1.16 build, which now supports LLaVA:
ollama pull llava:13b
This command pulls down the 13B parameter version of the LLaVA model. With that out of the way, I was ready to test it via the API.
To pass images through the API, you need to convert them to base64 encoding. Here’s a reduced example of how to structure your request:
{
"model": "llava:13b",
"prompt": "What is in this picture?",
"stream": false,
"images": ["/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyYnKSopGR8tMC0oMCUoKSj/2wBDAQcHBwoIChMKChMoGhYaKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCj/wgARC..."]
}

Prompt: What is in this picture? Response: In this picture, there is a person holding up a giant slice of pizza. The scene takes place in front of a truck, making it an interesting and unconventional setting for such a large piece of food.
Prompt: Is there a vehicle in this image? Response: Yes, there is a vehicle in the image, specifically a truck.
Prompt: What footwear is being worn? Response: The person is wearing boots.
Prompt: Describe this picture. Response: The image shows a colorful bar graph representing the number of students in different colors. Each color stands for a specific number of students, with orange indicating 75 students, green representing 100 students, blue showing 125 students, and yellow symbolizing 150 students. The heights of these bars correspond to the respective number of students they represent. The bar graph is organized in such a way that each color has a distinct line, making it easy to identify the different colors and their corresponding values. This visual representation helps demonstrate the distribution of student numbers throughout the various levels or categories.
Prompt: Give me a concise list of the colors in this image. Response: 1. Red, 2. Green, 3. Blue, 4. Yellow, 5. Orange, 6. Pink, 7. White.
Prompt: Does the blue bar have a bigger value than the green bar? Response: Yes, the blue bar has a bigger value than the green bar.
Prompt: Does the green bar have a bigger value than the blue bar? Response: No, the green bar does not have a bigger value than the blue bar.
LLaVA brings powerful vision capabilities to your private AI setup, allowing you to interpret and analyze visual data with ease. By running it on Ollama, you can maintain control over your data while leveraging state-of-the-art multimodal models. Whether you're working on a personal project or integrating this into a
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
11 January 2024
133 articles
Related Articles
Related Articles
More Stories