Grok-1.5V: X.ai Unveils Its First Multimodal Model with Enhanced Vision Capabilities

Models & Research

The Engineer

15 Apr 2024 · 3 min read

X.ai's Grok-1.5V integrates advanced vision capabilities with existing text-processing skills, enabling it to understand complex visual data and bridge the gap between digital and physical worlds.

April 12, 2024

X.ai has announced the preview of Grok-1.5V, their first-generation multimodal model that extends Grok's capabilities to process and understand visual information alongside text. This new release is a significant step forward in bridging the gap between digital and physical worlds, offering enhanced real-world understanding through advanced vision capabilities.

Capabilities

Grok-1.5V builds on the strong text processing abilities of its predecessor by adding robust support for various types of visual data, including documents, diagrams, charts, screenshots, and photographs. This multimodal approach allows Grok to handle complex tasks that require both textual and visual reasoning, making it a versatile tool for a wide range of applications.

Benchmark Performance

Grok-1.5V has been benchmarked against leading multimodal models in several domains, showcasing competitive performance across the board:

MMMU (Multi-discipline): 53.6% accuracy compared to GPT-4V's 56.8%, Claude 3 Sonnet's 59.4%, and Gemini Pro 1.5's 58.5%.
Mathvista (Math): 52.8% accuracy, outperforming GPT-4V (49.9%) and Claude 3 Opus (50.5%).
AI2D (Diagrams): 88.3% accuracy, closely matching Claude 3 Sonnet's 88.7%.
TextVQA (Text reading): 78.1% accuracy, slightly ahead of GPT-4V (78.0%) and Gemini Pro 1.5 (73.5%).
ChartQA (Charts): 76.1% accuracy, trailing behind Claude 3 Opus (81.1%) and Gemini Pro 1.5 (81.3%).
DocVQA (Documents): 85.6% accuracy, slightly below GPT-4V (88.4%) and Claude 3 Sonnet (89.5%).
RealWorldQA (Real-world understanding): 68.7% accuracy, significantly outperforming GPT-4V (61.4%), Claude 3 Opus (51.9%), and Gemini Pro 1.5 (67.5%).

Real-World Understanding

One of the standout features of Grok-1.5V is its enhanced real-world understanding, particularly in spatial reasoning. X.ai has introduced a new benchmark called RealWorldQA to evaluate this capability. The benchmark focuses on basic real-world spatial understanding tasks, such as interpreting diagrams and understanding physical layouts.

For example, consider the task of translating a flowchart into Python code:

User: Can you translate this into Python code?

Grok-1.5V: Certainly! The flowchart you’ve provided describes a simple guessing game where the computer generates a random number, and the user has to guess it. Here’s the Python code that represents the logic in the flowchart:

import random

def guess_number():
    target = random.randint(1, 10)
    while True:
        guess = int(input("Guess a number between 1 and 10: "))
        if guess == target:
            print("You won!")
            break
        else:
            print("Wrong guess, try again!")

if __name__ == "__main__":
    guess_number()

This code will run the guessing game as described in the flowchart, prompting the user to enter their guess until they correctly identify the random number.

Implementation Details

Grok-1.5V's architecture is designed to efficiently integrate visual and textual data. The model uses a combination of transformer-based text processing and convolutional neural networks (CNNs) for image understanding. This hybrid approach allows Grok-1.5V to handle complex multimodal tasks with high accuracy.

Availability

Grok-1.5V will be available soon to early testers and existing Grok users, allowing them to explore its capabilities in real-world applications. The model's performance on