
Share
LLaVA 1.6 delivers sharper visuals with four times the pixel density, boosting precision in tasks like object detection, while also enhancing text recognition to understand complex visual content more accurately.
The LLaVA (Large Language-and-Vision Assistant) model collection has been updated to version 1.6, bringing significant improvements that are particularly relevant for practitioners working with vision tasks. Here’s what’s new and why it matters:
LLaVA 1.6 is available in three parameter sizes:
ollama run llava:7bollama run llava:13bollama run llava:34bEach size offers a trade-off between performance and resource requirements, allowing you to choose the best fit for your specific use case.
To use LLaVA via the command line, reference image files using their paths:
% ollama run llava "describe this image: ./art.jpg"
The image shows a colorful poster featuring an illustration of a cartoon character with spiky hair. Above the character's head is a crown, suggesting royalty or high status. On the right side of the poster, there is text in French that reads "Jean-Michel Basquiat,"
For integration into Python applications, you can use the Ollama library:
import ollama

res = ollama.chat( model="llava", messages=[ { 'role': 'user', 'content': 'Describe this image:', 'images': ['./art.jpg'] } ] )
print(res['message']['content'])
#### JavaScript
For JavaScript applications, the Ollama library also provides a straightforward API:
```javascript
import ollama from 'ollama'
const res = await ollama.chat({
model: 'llava',
messages: [
{
role: 'user',
content: 'Describe this image:',
images: ['./art.jpg']
}
]
})
console.log(res.message.content)
Note: In the Ollama Python and JavaScript libraries, as well as the REST API, you can provide base64-encoded files in the images parameter. Refer to the API docs for more details.
% ollama run llava "tell me what do you see in this picture? ./pic.jpg"
In the image, a man wearing blue and white is holding video game controllers in his hands while smiling. His arms are raised above his head as he plays with the virtual reality gaming devices. The man appears to be enjoying himself and immersed in his gaming experience.
% ollama run llava "what does the text say? ./wordart.png"
This is an image with a capitalized, bolded text of "ollama." The background is black and the letters are yellow, which create a contrasting appearance. Although the words "ollama" make up the majority of the image, it could also be seen as part of a sign or advertisement due to its brightness and prominent font style.
The LLaVA 1.6 model family is published by an incredible team. For more information on how these models were trained, visit the LLaVA website.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
5 February 2024
88 articles
Related Articles
Related Articles
More Stories