Adding Vision to Your Private AI with Ollama and LLaVA

Tools & Engineering

The Engineer

11 Jan 2024 · 3 min read

Explore how integrating LLaVA with Ollama brings advanced visual capabilities to private AI setups, transforming text-based interactions into dynamic multimedia experiences.

When GPT4-V was released, it blew everyone away with its ability to interpret visual information like photographs. Now, a team has developed a free-to-use multimodal model called LLaVA (Large Language and Vision Assistant) that you can run on your own infrastructure. I decided to give it a try using my Ollama private instance running on a virtual machine with a 48GB NVIDIA GPU.

Setting Up LLaVA

First, I updated my Ollama instance to the latest 0.1.16 build, which now supports LLaVA:

ollama pull llava:13b

This command pulls down the 13B parameter version of the LLaVA model. With that out of the way, I was ready to test it via the API.

Using the API

To pass images through the API, you need to convert them to base64 encoding. Here’s a reduced example of how to structure your request:

{
  "model": "llava:13b",
  "prompt": "What is in this picture?",
  "stream": false,
  "images": ["/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBQYFBAYGBQYHBwYIChAKCgkJChQODwwQFxQYGBcUFhYaHSUfGhsjHBYWICwgIyYnKSopGR8tMC0oMCUoKSj/2wBDAQcHBwoIChMKChMoGhYaKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCj/wgARC..."]
}

Sample Interactions

What is in this picture?

Prompt: What is in this picture? Response: In this picture, there is a person holding up a giant slice of pizza. The scene takes place in front of a truck, making it an interesting and unconventional setting for such a large piece of food.

Is there a vehicle in this image?

Prompt: Is there a vehicle in this image? Response: Yes, there is a vehicle in the image, specifically a truck.

What footwear is being worn?

Prompt: What footwear is being worn? Response: The person is wearing boots.

Describe this picture

Prompt: Describe this picture. Response: The image shows a colorful bar graph representing the number of students in different colors. Each color stands for a specific number of students, with orange indicating 75 students, green representing 100 students, blue showing 125 students, and yellow symbolizing 150 students. The heights of these bars correspond to the respective number of students they represent. The bar graph is organized in such a way that each color has a distinct line, making it easy to identify the different colors and their corresponding values. This visual representation helps demonstrate the distribution of student numbers throughout the various levels or categories.

Give me a concise list of the colors in this image

Prompt: Give me a concise list of the colors in this image. Response: 1. Red, 2. Green, 3. Blue, 4. Yellow, 5. Orange, 6. Pink, 7. White.

Does the blue bar have a bigger value than the green bar?

Prompt: Does the blue bar have a bigger value than the green bar? Response: Yes, the blue bar has a bigger value than the green bar.

Does the green bar have a bigger value than the blue bar?

Prompt: Does the green bar have a bigger value than the blue bar? Response: No, the green bar does not have a bigger value than the blue bar.

Conclusion

LLaVA brings powerful vision capabilities to your private AI setup, allowing you to interpret and analyze visual data with ease. By running it on Ollama, you can maintain control over your data while leveraging state-of-the-art multimodal models. Whether you're working on a personal project or integrating this into a