
Share
As developers discovered limitations in using LLMs like GPT-4 for complex tasks, a new approach emerged: integrating reinforcement learning to empower AI agents with the autonomy and adaptability needed to excel beyond mere language generation.
In April 2023, just weeks after the launch of GPT-4, two ambitious projects-BabyAGI and AutoGPT-captured the attention of developers worldwide. These frameworks leveraged large language models (LLMs) like GPT-4 to create autonomous agents capable of solving complex tasks. The idea was simple: prompt GPT-4 with a goal (e.g., "create a 7-day meal plan"), have it generate a to-do list, and then tackle each task step-by-step.
However, the initial excitement quickly waned as it became evident that GPT-4, despite its impressive capabilities, wasn't designed for this kind of multi-step reasoning. While it could generate reasonable to-do lists and sometimes complete individual tasks, it often struggled to maintain focus and coherence over multiple steps.
LLMs like GPT-4 excel at generating text based on context but fall short when it comes to sustained, goal-directed behavior. This is where reinforcement learning (RL) enters the picture. RL is a type of machine learning that focuses on training agents to make decisions in complex environments through trial and error.
Reinforcement learning involves an agent interacting with an environment to maximize a reward signal. The key components are:
The goal of RL is to learn a policy-a strategy that dictates what action the agent should take in any given state-to maximize cumulative rewards over time. This approach is fundamentally different from supervised learning, where models are trained on labeled data, and unsupervised learning, which focuses on discovering patterns in data without explicit guidance.
Agentic models like Claude 3.5 Sonnet and o3 have emerged as a result of advancements in RL techniques. These models are designed to handle multi-step reasoning and maintain focus over extended periods. Here’s how they differ from traditional LLMs:

Training agentic models involves several technical challenges:
Agentic models have shown significant improvements in tasks that require multi-step reasoning:
The success of agentic models has opened up new avenues for research and development:
While LLMs like GPT-4 have revolutionized natural language processing, they fall short when it comes to sustained, goal-directed behavior. Reinforcement learning has emerged as a powerful technique for training agentic models that can handle complex tasks over
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
24 June 2025
88 articles
Related Articles
Related Articles
More Stories