
Share
X.ai's Grok-1.5V integrates advanced vision capabilities with existing text-processing skills, enabling it to understand complex visual data and bridge the gap between digital and physical worlds.
April 12, 2024
X.ai has announced the preview of Grok-1.5V, their first-generation multimodal model that extends Grok's capabilities to process and understand visual information alongside text. This new release is a significant step forward in bridging the gap between digital and physical worlds, offering enhanced real-world understanding through advanced vision capabilities.
Grok-1.5V builds on the strong text processing abilities of its predecessor by adding robust support for various types of visual data, including documents, diagrams, charts, screenshots, and photographs. This multimodal approach allows Grok to handle complex tasks that require both textual and visual reasoning, making it a versatile tool for a wide range of applications.
Grok-1.5V has been benchmarked against leading multimodal models in several domains, showcasing competitive performance across the board:
One of the standout features of Grok-1.5V is its enhanced real-world understanding, particularly in spatial reasoning. X.ai has introduced a new benchmark called RealWorldQA to evaluate this capability. The benchmark focuses on basic real-world spatial understanding tasks, such as interpreting diagrams and understanding physical layouts.
For example, consider the task of translating a flowchart into Python code:

User: Can you translate this into Python code?
Grok-1.5V: Certainly! The flowchart you’ve provided describes a simple guessing game where the computer generates a random number, and the user has to guess it. Here’s the Python code that represents the logic in the flowchart:
import random
def guess_number():
target = random.randint(1, 10)
while True:
guess = int(input("Guess a number between 1 and 10: "))
if guess == target:
print("You won!")
break
else:
print("Wrong guess, try again!")
if __name__ == "__main__":
guess_number()
This code will run the guessing game as described in the flowchart, prompting the user to enter their guess until they correctly identify the random number.
Grok-1.5V's architecture is designed to efficiently integrate visual and textual data. The model uses a combination of transformer-based text processing and convolutional neural networks (CNNs) for image understanding. This hybrid approach allows Grok-1.5V to handle complex multimodal tasks with high accuracy.
Grok-1.5V will be available soon to early testers and existing Grok users, allowing them to explore its capabilities in real-world applications. The model's performance on
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
15 April 2024
88 articles
Related Articles
Related Articles
More Stories