
Share
Explore how reinforcement learning environments challenge large language models and discover the `verifiers` framework and AgentDojo, tools that streamline the creation and evaluation of these critical test scenarios.
Over the past few weeks, I've been diving deep into contributing to Prime Intellect’s Environment Hub, focusing on reinforcement learning (RL) environments. These setups are crucial for training and evaluating large language models (LLMs), offering a structured way to test their capabilities in various scenarios. In this article, we’ll speedrun through what RL environments are, introduce the verifiers framework, and walk through creating an environment for the AgentDojo benchmark.
RL environments are essentially complex obstacle courses designed for LLMs. They provide a structured setting where models can interact, receive feedback (rewards), and learn to perform tasks more effectively. Think of these environments as intricate mazes for LLMs: if the model navigates the maze successfully, it receives a reward, which reinforces its learning.
Key components of an RL environment include:
A rollout is a sequence of states, actions, and rewards generated as the LLM interacts with the environment. This process helps the model learn how to solve tasks in a general manner, much like conditioning through positive reinforcement.
verifiers FrameworkThe verifiers framework is a powerful tool for building and evaluating RL environments. It provides essential primitives and hooks that make it easy to set up and manage your environment. One of its key advantages is the ability to convert any existing benchmark into an RL environment, streamlining the process of training and evaluation.
Key features of the verifiers framework include:
To create an RL environment using verifiers, you typically override one of two base classes:
vf.SingleTurnEnv: For single Q&A pairs.vf.MultiTurnEnv: For multi-turn conversations, with hooks for generating responses and handling complex interactions.AgentDojo is a benchmark designed to test the capabilities of LLMs in various scenarios. To create an RL environment for AgentDojo using verifiers, follow these steps:

Define the Dataset: Structure your data according to the verifiers format. This includes defining states, actions, and expected rewards.
Implement Multi-Turn Interactions: Use the vf.MultiTurnEnv class to handle multi-step conversations. Implement the env_response hook to generate responses from the environment based on the LLM's actions.
Tool Use Functionality: If your benchmark involves using tools or resources, implement the necessary functions to allow the LLM to interact with these tools.
Calculate Rewards: Define a reward function that evaluates the LLM's performance and provides appropriate feedback.
Resource Management: Ensure proper setup and teardown of any required resources, such as sandboxes or virtual machines.
Here’s a simplified example of how you might implement an AgentDojo environment:
from verifiers import vf
class AgentDojoEnv(vf.MultiTurnEnv):
def __init__(self, dataset):
super().__init__(dataset)
self.current_state = None
def reset(self):
self.current_state = self.dataset.get_initial_state()
return self.current_state
def step(self, action):
next_state, reward, done = self.dataset.apply_action(self.current_state, action)
self.current_state = next_state
return next_state, reward, done
def env_response(self, state, action):
# Generate a response based on the current state and action
response = self.dataset.generate_response(state, action)
return response
RL environments are essential for training and evaluating LLMs, providing a structured
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
28 October 2025
88 articles
Related Articles
Related Articles
More Stories