ShowUI-2B: A Lightweight Vision-Language-Action Model for GUI Agents

Tools & Engineering

The Engineer

28 Nov 2024 · 3 min read

ShowUI-2B merges computer vision and natural language processing into a compact model, enabling GUI agents to understand and interact with digital interfaces more effectively than ever before.

ShowUI-2B is a lightweight vision-language-action (VLA) model designed to power graphical user interface (GUI) agents. This 2 billion parameter model, developed by ShowLab and available on Hugging Face, combines the strengths of computer vision and natural language processing to enable sophisticated interactions with GUI elements. Whether you're building a chatbot that can navigate web pages or an assistant that can interact with desktop applications, ShowUI-2B is a powerful tool to have in your toolkit.

Key Features

Lightweight: At 2 billion parameters, ShowUI-2B strikes a balance between performance and efficiency.
Vision-Language-Action Integration: It can process images, understand text, and generate actionable outputs like click coordinates.
Versatile Applications: Suitable for web automation, desktop application interaction, and more.

Technical Overview

Model Architecture

ShowUI-2B is built on the Qwen2VLForConditionalGeneration architecture, which extends the capabilities of traditional transformer models to handle multi-modal inputs. The model can process both visual (image) and textual data, making it well-suited for tasks that require understanding and interacting with GUI elements.

Key Components

Qwen2VLForConditionalGeneration: This is the core model class responsible for generating outputs based on multi-modal inputs.
AutoTokenizer: Used to tokenize text inputs.
AutoProcessor: Handles the preprocessing of both image and text data, ensuring they are in a format suitable for the model.

Quick Start Guide

Step 1: Load the Model

First, you need to install the necessary dependencies and load the ShowUI-2B model. Here’s how you can do it:

import ast
import torch
from PIL import Image, ImageDraw
from qwen_vl_utils import process_vision_info
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

def draw_point(image_input, point=None, radius=5):
    if isinstance(image_input, str):
        image = Image.open(BytesIO(requests.get(image_input).content)) if image_input.startswith('http') else Image.open(image_input)
    else:
        image = image_input

    if point:
        x, y = point[0] * image.width, point[1] * image.height
        ImageDraw.Draw(image).ellipse((x - radius, y - radius, x + radius, y + radius), fill='red')
    display(image)
    return

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "showlab/ShowUI-2B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

min_pixels = 256 * 28 * 28
max_pixels = 1344 * 28 * 28

processor = AutoProcessor.from_pretrained("showlab/ShowUI-2B", min_pixels=min_pixels, max_pixels=max_pixels)

Step 2: UI Grounding

Next, you can use the model to perform UI grounding. This involves providing an image of a GUI and a query, and getting the coordinates of the element that matches the query.

img_url = 'examples/web_dbd7514b-9ca3-40cd-b09a-990f7b955da1.png'
query = "Nahant"

_SYSTEM = "Based on the screenshot of the page, I give a text description and you give its corresponding location. The coordinate represents a clickable location [x, y] for an element, which is a relative coordinate on the screenshot, scaled from 0 to 1."
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": _SYSTEM},
            {"type": "image", "image": img_url, "min_pixels": min_pixels, "max_pixels": max_pixels},
            {"type": "text", "text": query}
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image