
Share
AssistantBench pushes web agents beyond simple queries with complex, real-world challenges across hundreds of websites, testing their true capabilities in practical scenarios.
AssistantBench, a new benchmark introduced by researchers from Tel Aviv University, the University of Pennsylvania, Allen Institute for AI, the University of Washington, and Princeton University, evaluates web agents' ability to solve realistic and time-consuming tasks. The benchmark includes 214 tasks spanning multiple domains across 525 pages from 258 different websites. This article delves into the technical details and performance metrics of AssistantBench.
AssistantBench introduces a more challenging and diverse set of tasks for web agents, requiring them to navigate complex web environments, gather information, and execute multi-step plans. Unlike previous benchmarks that focus on simpler or synthetic tasks, AssistantBench emphasizes real-world scenarios, making it a significant step forward in evaluating the capabilities of AI models.
To tackle the challenges posed by AssistantBench, the researchers introduced a new web agent called SeePlanAct (SPA). SPA builds upon the existing SeeAct framework by adding specialized planning and memory components. These enhancements enable SPA to better handle multi-step tasks and maintain context across different steps.
Despite the improvements, even the best-performing model, SPA (closed-book), achieves only 25.2% accuracy on the AssistantBench test set. This highlights the difficulty of the tasks and the need for further advancements in web agent capabilities.

SPA (Closed-book):
SeeAct (Closed-book):
Closed-book LM (1-shot):
Retrieval-augmented LM (1-shot) → CB:
Retrieval-augmented LM (0-shot) → CB:
Closed-book LM (0-shot):
Retrieval-augmented LM (0-shot):
SPA:
Retrieval-augmented LM (1-shot):
Tags
Original Sources
↗ https://assistantbench.github.io/?utm_source=tldrai
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
23 July 2024
133 articles
Related Articles
Related Articles
More Stories