
Share
VAKRA challenges AI agents with complex tasks requiring intricate reasoning and tool use, offering a realistic testbed that goes beyond simple skill assessments to measure success in multi-step workflows.
IBM Research recently introduced VAKRA, a groundbreaking tool-grounded, executable benchmark designed to evaluate how well AI agents reason and act in enterprise-like environments. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents using full execution traces to assess whether agents can reliably complete multi-step workflows.
VAKRA stands out by providing an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains. This setup is crucial because it:
VAKRA tasks are designed to be complex and multi-faceted. Here’s a breakdown:
For example, an agent might be asked to:

Early results show that models perform poorly on VAKRA, highlighting several key areas where improvement is needed:
VAKRA is a critical step forward in benchmarking AI agents for real-world applications. By simulating complex, multi-step workflows, it helps researchers:
If you’re interested in contributing to or participating in the VAKRA benchmark, here are some resources:
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
16 April 2026
133 articles
Related Articles
Related Articles
More Stories