AssistantBench: Evaluating Web Agents on Realistic and Time-Consuming Tasks

Models & Research

The Engineer

23 Jul 2024 · 3 min read

AssistantBench pushes web agents beyond simple queries with complex, real-world challenges across hundreds of websites, testing their true capabilities in practical scenarios.

AssistantBench, a new benchmark introduced by researchers from Tel Aviv University, the University of Pennsylvania, Allen Institute for AI, the University of Washington, and Princeton University, evaluates web agents' ability to solve realistic and time-consuming tasks. The benchmark includes 214 tasks spanning multiple domains across 525 pages from 258 different websites. This article delves into the technical details and performance metrics of AssistantBench.

What Changed Technically

AssistantBench introduces a more challenging and diverse set of tasks for web agents, requiring them to navigate complex web environments, gather information, and execute multi-step plans. Unlike previous benchmarks that focus on simpler or synthetic tasks, AssistantBench emphasizes real-world scenarios, making it a significant step forward in evaluating the capabilities of AI models.

Task Complexity: Tasks in AssistantBench are designed to be realistic and time-consuming, often requiring multiple steps and decision-making.
Diverse Domains: The benchmark covers a wide range of domains, ensuring that web agents can handle various types of tasks from different websites.
Real-World Data: The dataset is derived from actual user experiences, making it more relevant and challenging for AI models.

SeePlanAct (SPA) Agent

To tackle the challenges posed by AssistantBench, the researchers introduced a new web agent called SeePlanAct (SPA). SPA builds upon the existing SeeAct framework by adding specialized planning and memory components. These enhancements enable SPA to better handle multi-step tasks and maintain context across different steps.

Planning Component: The planning component helps SPA break down complex tasks into manageable sub-tasks and execute them in a coherent sequence.
Memory Component: The memory component allows SPA to retain information gathered during the task execution, ensuring that it can refer back to previous steps as needed.

Performance on AssistantBench

Despite the improvements, even the best-performing model, SPA (closed-book), achieves only 25.2% accuracy on the AssistantBench test set. This highlights the difficulty of the tasks and the need for further advancements in web agent capabilities.

SPA (Closed-book):
- Accuracy: 25.2%
- Answer rate: 91.7%
- Precision: 27.5%
- Exact match: 9.9%
SeeAct (Closed-book):
- Accuracy: 23.4%
- Answer rate: 89.5%
- Precision: 26.1%
- Exact match: 9.4%
Closed-book LM (1-shot):
- Accuracy: 22.2%
- Answer rate: 89.5%
- Precision: 24.8%
- Exact match: 8.3%
Retrieval-augmented LM (1-shot) → CB:
- Accuracy: 19.5%
- Answer rate: 92.8%
- Precision: 21.0%
- Exact match: 6.1%
Retrieval-augmented LM (0-shot) → CB:
- Accuracy: 18.7%
- Answer rate: 93.9%
- Precision: 19.9%
- Exact match: 6.6%
Closed-book LM (0-shot):
- Accuracy: 16.5%
- Answer rate: 53.6%
- Precision: 30.7%
- Exact match: 6.1%
Retrieval-augmented LM (0-shot):
- Accuracy: 11.8%
- Answer rate: 60.2%
- Precision: 19.5%
- Exact match: 5.5%
SPA:
- Accuracy: 11.1%
- Answer rate: 35.9%
- Precision: 30.9%
- Exact match: 5.5%
Retrieval-augmented LM (1-shot):
- Accuracy: 10.7%
- Answer rate: 4