
Share
ShowUI-2B merges computer vision and natural language processing into a compact model, enabling GUI agents to understand and interact with digital interfaces more effectively than ever before.
ShowUI-2B is a lightweight vision-language-action (VLA) model designed to power graphical user interface (GUI) agents. This 2 billion parameter model, developed by ShowLab and available on Hugging Face, combines the strengths of computer vision and natural language processing to enable sophisticated interactions with GUI elements. Whether you're building a chatbot that can navigate web pages or an assistant that can interact with desktop applications, ShowUI-2B is a powerful tool to have in your toolkit.
ShowUI-2B is built on the Qwen2VLForConditionalGeneration architecture, which extends the capabilities of traditional transformer models to handle multi-modal inputs. The model can process both visual (image) and textual data, making it well-suited for tasks that require understanding and interacting with GUI elements.
First, you need to install the necessary dependencies and load the ShowUI-2B model. Here’s how you can do it:

import ast
import torch
from PIL import Image, ImageDraw
from qwen_vl_utils import process_vision_info
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
def draw_point(image_input, point=None, radius=5):
if isinstance(image_input, str):
image = Image.open(BytesIO(requests.get(image_input).content)) if image_input.startswith('http') else Image.open(image_input)
else:
image = image_input
if point:
x, y = point[0] * image.width, point[1] * image.height
ImageDraw.Draw(image).ellipse((x - radius, y - radius, x + radius, y + radius), fill='red')
display(image)
return
model = Qwen2VLForConditionalGeneration.from_pretrained(
"showlab/ShowUI-2B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
min_pixels = 256 * 28 * 28
max_pixels = 1344 * 28 * 28
processor = AutoProcessor.from_pretrained("showlab/ShowUI-2B", min_pixels=min_pixels, max_pixels=max_pixels)
Next, you can use the model to perform UI grounding. This involves providing an image of a GUI and a query, and getting the coordinates of the element that matches the query.
img_url = 'examples/web_dbd7514b-9ca3-40cd-b09a-990f7b955da1.png'
query = "Nahant"
_SYSTEM = "Based on the screenshot of the page, I give a text description and you give its corresponding location. The coordinate represents a clickable location [x, y] for an element, which is a relative coordinate on the screenshot, scaled from 0 to 1."
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": _SYSTEM},
{"type": "image", "image": img_url, "min_pixels": min_pixels, "max_pixels": max_pixels},
{"type": "text", "text": query}
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
28 November 2024
88 articles
Related Articles
Related Articles
More Stories