ScreenSuite: A Comprehensive Evaluation Suite for GUI Agents

Tools & Engineering

The Engineer

10 Jun 2025 · 3 min read

ScreenSuite offers a robust framework for testing GUI agents, covering everything from navigation to data entry tasks. Discover how this new tool reshapes the evaluation landscape for vision language models.

Published June 6, 2025

Over the past few weeks, we've been working hard to make GUI agents more open, accessible, and easy to integrate. As part of this effort, we've created the most comprehensive benchmarking suite for evaluating the performance of these agents. Today, we're excited to introduce ScreenSuite, a powerful tool that makes it easier than ever to assess Vision Language Models (VLMs) across various agentic capabilities.

What is a GUI Agent?

A GUI agent is an AI model that can interact with graphical user interfaces (GUIs). Think of it as a robot that can navigate and perform tasks on your desktop or mobile device. For example, you might ask the agent to "Fill the rest of this Excel column," and it would use screen captures to understand the context and execute actions like click(x=130, y=540) to open a web browser, type("Value for XYZ in 2025"), or scroll(down=2) to read further.

To see a GUI agent in action, try our Open Computer Agent, powered by the Qwen2.5-VL-72B model. A well-designed GUI agent can perform a wide range of tasks, from scrolling through Google Maps to editing files and making online purchases.

Introducing ScreenSuite 🥳

ScreenSuite is the most comprehensive evaluation suite for GUI agents. It provides a structured way to benchmark and compare different models across various capabilities. Here are some key features:

Diverse Tasks: ScreenSuite includes a wide range of tasks that test different aspects of an agent's performance, such as navigation, data entry, and interaction with complex interfaces.
Real-World Scenarios: The tasks are designed to mimic real-world use cases, ensuring that the evaluation is practical and relevant.
Automated Evaluation: ScreenSuite automates the evaluation process, making it easy to run tests and generate reports.
Open Source: The suite is open source, allowing researchers and developers to contribute and improve it.

Technical Details

ScreenSuite is built on a modular architecture that allows for flexibility and extensibility. Here are some of the key components:

Task Definitions: Each task is defined in a JSON format, making it easy to add new tasks or modify existing ones.
Environment Simulation: ScreenSuite uses a headless browser and virtual desktop environment to simulate real-world conditions.
Action Execution: The suite supports a variety of actions, including clicks, typing, scrolling, and more.
Performance Metrics: ScreenSuite measures various metrics, such as task completion time, accuracy, and robustness.

Benchmarking Results

We've used ScreenSuite to evaluate several popular VLMs. Here are some preliminary results:

Qwen2.5-VL-72B: This model performed exceptionally well across all tasks, demonstrating high accuracy and efficiency.
Other Models: We also tested other models, such as CLIP and BLIP, which showed varying levels of performance.

Why It Matters

ScreenSuite is a significant step forward in the evaluation of GUI agents. By providing a standardized and comprehensive benchmarking suite, it helps researchers and developers:

Identify Strengths and Weaknesses: Understand where their models excel and where they need improvement.
Compare Models: Make informed decisions about which models to use for specific tasks.
Drive Innovation: Encourage the development of better GUI agents by providing a clear evaluation framework.

Conclusion

ScreenSuite is now available on GitHub, and we encourage you to try it out and contribute to its development. Whether you're a researcher looking to evaluate your latest model or a developer building a new application, ScreenSuite can help you achieve your goals more effectively.