
Share
ScreenSuite offers a robust framework for testing GUI agents, covering everything from navigation to data entry tasks. Discover how this new tool reshapes the evaluation landscape for vision language models.
Published June 6, 2025
Over the past few weeks, we've been working hard to make GUI agents more open, accessible, and easy to integrate. As part of this effort, we've created the most comprehensive benchmarking suite for evaluating the performance of these agents. Today, we're excited to introduce ScreenSuite, a powerful tool that makes it easier than ever to assess Vision Language Models (VLMs) across various agentic capabilities.
A GUI agent is an AI model that can interact with graphical user interfaces (GUIs). Think of it as a robot that can navigate and perform tasks on your desktop or mobile device. For example, you might ask the agent to "Fill the rest of this Excel column," and it would use screen captures to understand the context and execute actions like click(x=130, y=540) to open a web browser, type("Value for XYZ in 2025"), or scroll(down=2) to read further.
To see a GUI agent in action, try our Open Computer Agent, powered by the Qwen2.5-VL-72B model. A well-designed GUI agent can perform a wide range of tasks, from scrolling through Google Maps to editing files and making online purchases.
ScreenSuite is the most comprehensive evaluation suite for GUI agents. It provides a structured way to benchmark and compare different models across various capabilities. Here are some key features:
ScreenSuite is built on a modular architecture that allows for flexibility and extensibility. Here are some of the key components:

We've used ScreenSuite to evaluate several popular VLMs. Here are some preliminary results:
ScreenSuite is a significant step forward in the evaluation of GUI agents. By providing a standardized and comprehensive benchmarking suite, it helps researchers and developers:
ScreenSuite is now available on GitHub, and we encourage you to try it out and contribute to its development. Whether you're a researcher looking to evaluate your latest model or a developer building a new application, ScreenSuite can help you achieve your goals more effectively.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
10 June 2025
133 articles
Related Articles
Related Articles
More Stories