Beating Pokémon Red with a Lightweight RL Agent: An Open Source Milestone

Tools & Engineering

The Engineer

6 Mar 2025 · 3 min read

This breakthrough demonstrates the efficiency of modern reinforcement learning, showcasing how a tiny AI can master complex games without significant shortcuts or modifications.

Since 2020, our team has been on a mission to develop a reinforcement learning (RL) agent capable of beating the classic 1996 game Pokémon Red. As of February 2025, we've achieved this goal using a policy with fewer than 10 million parameters-over 60,500 times smaller than DeepSeekV3-and with minimal simplifications. This article delves into the technical details and significance of our approach.

What Changed?

Policy Size: We managed to train an RL agent with fewer than 10 million parameters.
Minimal Simplifications: The environment was kept as close to the original game as possible, without significant modifications.
Open Source Availability: All code is open-sourced and available on GitHub.

Why Pokémon Red?

Pokémon Red, a single-player JRPG, challenges players to become the "champion" by capturing and battling Pokémon. Here’s why this game is an excellent testbed for RL:

Complexity: On par with games like Go, StarCraft II, or Minecraft.
Decision Making: Requires intricate reasoning and multi-tasking.
Nonlinearity: The game's structure is not straightforward, adding to the challenge.
Duration: Takes around 25 hours on average for a new player to complete.

Why Use RL?

We explored several approaches before settling on reinforcement learning:

Supervised Learning: Would have required a well-labeled and plentiful dataset, which was impractical given our resources.
Behavioral Cloning: Attempted to imitate known speedrun routes but faced difficulties in creating an efficient data collection system.

Technical Details

Environment Setup

We leveraged the Pokémon Reverse Engineering Team (PRET) and the PyBoy Python Gameboy Emulation projects to introspect and extract game data. These tools provided the necessary infrastructure for our RL experiments.

Game State Representation: The state of the game was represented as a combination of observable in-game variables, such as player position, Pokémon stats, and inventory items.
Action Space: Actions included moving in different directions, interacting with objects, and using items or moves.

Training Process

The training process involved several key steps:

Reward Function: We designed a reward function to guide the agent towards the goal of becoming the champion. This included rewards for completing battles, acquiring new Pokémon, and progressing through the game's storyline.
Exploration vs. Exploitation: Balancing exploration (trying new actions) with exploitation (using known effective strategies) was crucial. We used techniques like epsilon-greedy and Boltzmann exploration to achieve this balance.

Policy Architecture

The policy network was designed to be lightweight yet effective:

Input Layer: Processed the game state.
Hidden Layers: Consisted of a few dense layers with ReLU activations.
Output Layer: Produced probabilities for each action in the action space.

Benchmarks and Performance

Training Time: The agent required several days of training on a single GPU to achieve competent performance.
Success Rate: The final policy successfully completed the game, reaching the champion, with high reliability.
Scalability: Despite the small parameter count, the policy demonstrated robust generalization across different in-game scenarios.

Future Work

We are continuously improving the codebase and exploring new techniques to enhance the agent's performance. Contributions from the community are welcome, and we encourage readers to experiment with the provided code.