LLaVA 1.6 Updates: Enhanced Image Resolution and Text Recognition Capabilities

Models & Research

The Engineer

5 Feb 2024 · 3 min read

LLaVA 1.6 delivers sharper visuals with four times the pixel density, boosting precision in tasks like object detection, while also enhancing text recognition to understand complex visual content more accurately.

February 2, 2024

New LLaVA Models

The LLaVA (Large Language-and-Vision Assistant) model collection has been updated to version 1.6, bringing significant improvements that are particularly relevant for practitioners working with vision tasks. Here’s what’s new and why it matters:

Higher Image Resolution: The models now support up to 4x more pixels, allowing them to capture finer details in images. This is crucial for applications like object detection and image classification, where higher resolution can significantly improve accuracy.
Improved Text Recognition and Reasoning Capabilities: LLaVA has been trained on additional datasets containing documents, charts, and diagrams. This enhancement means the model can better understand and reason about text within complex visual contexts, making it more versatile for tasks like document analysis and OCR (Optical Character Recognition).
More Permissive Licenses: The models are now available under both the Apache 2.0 license and the LLaMA 2 Community License, providing more flexibility for developers.

Model Sizes

LLaVA 1.6 is available in three parameter sizes:

ollama run llava:7b
ollama run llava:13b
ollama run llava:34b

Each size offers a trade-off between performance and resource requirements, allowing you to choose the best fit for your specific use case.

Usage

CLI

To use LLaVA via the command line, reference image files using their paths:

% ollama run llava "describe this image: ./art.jpg"
The image shows a colorful poster featuring an illustration of a cartoon character with spiky hair. Above the character's head is a crown, suggesting royalty or high status. On the right side of the poster, there is text in French that reads "Jean-Michel Basquiat,"

Python

For integration into Python applications, you can use the Ollama library:

import ollama

res = ollama.chat( model="llava", messages=[ { 'role': 'user', 'content': 'Describe this image:', 'images': ['./art.jpg'] } ] )

print(res['message']['content'])


#### JavaScript

For JavaScript applications, the Ollama library also provides a straightforward API:

```javascript
import ollama from 'ollama'

const res = await ollama.chat({
    model: 'llava',
    messages: [
        {
            role: 'user',
            content: 'Describe this image:',
            images: ['./art.jpg']
        }
    ]
})

console.log(res.message.content)

Note: In the Ollama Python and JavaScript libraries, as well as the REST API, you can provide base64-encoded files in the images parameter. Refer to the API docs for more details.

Examples

Object Detection

% ollama run llava "tell me what do you see in this picture? ./pic.jpg"
In the image, a man wearing blue and white is holding video game controllers in his hands while smiling. His arms are raised above his head as he plays with the virtual reality gaming devices. The man appears to be enjoying himself and immersed in his gaming experience.

Text Recognition

% ollama run llava "what does the text say? ./wordart.png"
This is an image with a capitalized, bolded text of "ollama." The background is black and the letters are yellow, which create a contrasting appearance. Although the words "ollama" make up the majority of the image, it could also be seen as part of a sign or advertisement due to its brightness and prominent font style.

Thank You LLaVA Team

The LLaVA 1.6 model family is published by an incredible team. For more information on how these models were trained, visit the LLaVA website.