
Share
While building reinforcement learning environments can be tempting with six-figure buyout offers, the reality is fraught with technical challenges and stiff competition from established players. Think carefully before diving in.
The first person who sold a reinforcement learning (RL) environment to a frontier AI lab probably felt like they stumbled upon an infinite money glitch. Today, it's no secret that these labs are willing to pay hundreds of thousands, and sometimes even millions, for clones of popular tools like Linear and Salesforce. If you're reading this, you've likely considered quitting your day job to start a company that builds these highly lucrative Next.js apps. In this post, I'll argue why you should think twice before jumping on the bandwagon.
For those unfamiliar, an RL environment is essentially a sandbox where AI models like Claude and GPT can learn from interactions. It maintains an internal state, prompts the AI to take actions to complete tasks, and assigns scores based on outcomes. These environments often mimic popular websites or enterprise software tools, such as Doordash, Linear, or Amazon, teaching AIs to navigate these interfaces. They can also be text-only, like the TextArena project, which trains AIs to play games like Set and Blackjack.
RL environments have gained prominence due to a new paradigm in LLM post-training that teaches models new skills based on verifiable rewards. OpenAI realized in 2023 that by asking a model to "think" before solving a math problem and reinforcing the correct thought processes, the model could become highly proficient at math. Since then, labs have been racing to generalize this approach, teaching AIs to use computers, conduct online research, and tackle long-horizon coding tasks.
This has led to a surge of startups whose sole purpose is to provide these labs with increasingly complex and challenging environments. The prevailing theory is that the machine learning algorithms are already advanced; the remaining challenge is to offer models more realistic "simulations" so they don't exhibit strange behaviors when deployed in real-world scenarios and tasked with economically valuable work.
If you're a solo researcher or have a job that gives you plenty of free time, there's virtually no downside to creating an RL environment and selling it to labs like OpenAI or Anthropic (or even Amazon and Meta). You might just hit the jackpot and retire early. For example, some individuals have made millions by developing and selling these environments.

However, if you're aiming to build a generational business that stands the test of time, I would advise against diving into the RL environment market. Here’s why:
Before RL became the dominant post-training paradigm, supervised fine-tuning was the norm. This involved teaching models to predict the next word by showing them examples. It was simpler but less effective. As AI labs shifted towards RL, crowdworkers who were once in high demand for labeled data saw their roles diminish.
This shift highlights the volatility of the AI industry. What seems like a goldmine today could quickly become a dead-end tomorrow. If you're considering building an RL environment startup, weigh the potential short-term gains against the long-term risks and uncertainties.
While the allure of quick riches from building RL environments is tempting, it's crucial to consider the broader context. For solo researchers or those with flexible schedules, it might be a worthwhile gamble. However, for entrepreneurs looking to build sustainable businesses, there are better opportunities in the ever-evolving landscape of AI.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
9 September 2025
133 articles
Related Articles
Related Articles
More Stories