Speedrunning RL Environments with AgentDojo and the Verifiers Framework

Models & Research

The Engineer

28 Oct 2025 · 3 min read

Explore how reinforcement learning environments challenge large language models and discover the `verifiers` framework and AgentDojo, tools that streamline the creation and evaluation of these critical test scenarios.

Over the past few weeks, I've been diving deep into contributing to Prime Intellect’s Environment Hub, focusing on reinforcement learning (RL) environments. These setups are crucial for training and evaluating large language models (LLMs), offering a structured way to test their capabilities in various scenarios. In this article, we’ll speedrun through what RL environments are, introduce the verifiers framework, and walk through creating an environment for the AgentDojo benchmark.

What Are RL Environments?

RL environments are essentially complex obstacle courses designed for LLMs. They provide a structured setting where models can interact, receive feedback (rewards), and learn to perform tasks more effectively. Think of these environments as intricate mazes for LLMs: if the model navigates the maze successfully, it receives a reward, which reinforces its learning.

Key components of an RL environment include:

State: The current situation or context in the environment.
Action: What the LLM decides to do based on the state.
Reward: Feedback given to the LLM for its actions, guiding it towards better performance.

A rollout is a sequence of states, actions, and rewards generated as the LLM interacts with the environment. This process helps the model learn how to solve tasks in a general manner, much like conditioning through positive reinforcement.

Introducing the `verifiers` Framework

The verifiers framework is a powerful tool for building and evaluating RL environments. It provides essential primitives and hooks that make it easy to set up and manage your environment. One of its key advantages is the ability to convert any existing benchmark into an RL environment, streamlining the process of training and evaluation.

Key features of the verifiers framework include:

Dataset Format: Standardizes how data is structured for consistency.
Multi-Turn Interactions: Supports multi-step conversations between the LLM and the environment.
Tool Use Functionality: Allows the LLM to use tools or resources within the environment.
Calculating Rewards: Provides methods to determine and assign rewards based on performance.
Resource Management: Handles setup and teardown of resources like sandboxes, VMs, etc.

To create an RL environment using verifiers, you typically override one of two base classes:

vf.SingleTurnEnv: For single Q&A pairs.
vf.MultiTurnEnv: For multi-turn conversations, with hooks for generating responses and handling complex interactions.

Creating an Environment for AgentDojo

AgentDojo is a benchmark designed to test the capabilities of LLMs in various scenarios. To create an RL environment for AgentDojo using verifiers, follow these steps:

Define the Dataset: Structure your data according to the verifiers format. This includes defining states, actions, and expected rewards.
Implement Multi-Turn Interactions: Use the vf.MultiTurnEnv class to handle multi-step conversations. Implement the env_response hook to generate responses from the environment based on the LLM's actions.
Tool Use Functionality: If your benchmark involves using tools or resources, implement the necessary functions to allow the LLM to interact with these tools.
Calculate Rewards: Define a reward function that evaluates the LLM's performance and provides appropriate feedback.
Resource Management: Ensure proper setup and teardown of any required resources, such as sandboxes or virtual machines.

Here’s a simplified example of how you might implement an AgentDojo environment:

from verifiers import vf

class AgentDojoEnv(vf.MultiTurnEnv):
    def __init__(self, dataset):
        super().__init__(dataset)
        self.current_state = None

    def reset(self):
        self.current_state = self.dataset.get_initial_state()
        return self.current_state

    def step(self, action):
        next_state, reward, done = self.dataset.apply_action(self.current_state, action)
        self.current_state = next_state
        return next_state, reward, done

    def env_response(self, state, action):
        # Generate a response based on the current state and action
        response = self.dataset.generate_response(state, action)
        return response

Conclusion

RL environments are essential for training and evaluating LLMs, providing a structured

Speedrunning RL Environments with AgentDojo and the Verifiers Framework

What Are RL Environments?

Introducing the verifiers Framework

Creating an Environment for AgentDojo

Conclusion

Introducing the `verifiers` Framework