Revisiting the Bitter Lesson: LLMs and the Case for Reinforcement Learning

Models & Research

The Engineer

3 Oct 2025 · 4 min read

Richard Sutton’s “Bitter Lesson” warns against over-relying on human expertise in AI, yet recent advances in LLMs suggest a nuanced need for reinforcement learning to bridge the gap.

01 Oct, 2025

I recently had a chance to listen to the Dwarkesh pod with Sutton, which sparked some interesting thoughts on the state of AI research, particularly in the context of Large Language Models (LLMs) and the "Bitter Lesson."

For those unfamiliar, Richard Sutton's "The Bitter Lesson" has become a foundational text in the LLM community. The essay argues that the most effective approaches to AI are those that leverage massive amounts of computation with minimal human intervention. This idea is often summarized as being "bitter lesson pilled," meaning an approach is considered valid if it scales well with additional compute. LLMs, which have shown remarkable performance gains as they grow in size and training data, seem to embody this principle perfectly.

However, Sutton himself isn't so convinced that LLMs are truly "bitter lesson pilled." This is a bit of a twist for the community, given how much we've leaned on his ideas. Here’s why:

Finite Human Data: LLMs are trained on vast datasets of human-generated text, which, while extensive, are finite. What happens when you run out of new data? How do you avoid perpetuating biases inherent in human language?
Supervised Learning vs. Reinforcement Learning: Sutton argues that supervised learning, where models are fine-tuned on labeled data, is fundamentally different from how animals learn. Animals don’t receive direct supervision; they learn through interaction and reinforcement. This raises questions about the scalability and generalizability of LLMs.

In the podcast, Dwarkesh, representing the LLM researcher perspective, and Sutton, a self-described "classicist," have a fascinating exchange. Sutton envisions an AI system more akin to Alan Turing’s idea of a "child machine", a system that learns through dynamic interaction with the environment rather than through static pretraining on human data.

Here are some key points from Sutton’s perspective:

No Giant Pretraining Stage: In Sutton's view, there shouldn’t be a massive initial training phase where the model imitates web content. Instead, the model should learn continuously through interaction.
No Supervised Finetuning: Animals don’t receive direct instruction; they observe and learn from their environment. This is a crucial difference from LLMs, which often undergo supervised finetuning.
Intrinsic Motivation: Sutton emphasizes the importance of intrinsic rewards like curiosity and fun. These motivate learning in animals and should be incorporated into AI systems.
Continuous Learning: Unlike LLMs, which are typically trained once and then deployed, Sutton’s ideal system would continue to learn and adapt during deployment.

Sutton’s approach is fundamentally rooted in reinforcement learning (RL), where the agent learns by interacting with its environment. This contrasts sharply with the pretraining-finetuning paradigm of LLMs. He argues that even if pretraining is seen as a form of initialization, it still introduces human biases that can limit the model's potential.

One of Sutton’s favorite examples is AlphaZero vs. AlphaGo. AlphaZero, which starts from scratch and learns through self-play, eventually outperforms AlphaGo, which initializes from human games. This suggests that starting with a clean slate and learning through interaction might be more powerful than relying on pre-existing data.

Sutton’s vision aligns more closely with the animal kingdom, where learning is continuous and driven by intrinsic motivations. He believes that if we can understand how simpler animals like squirrels learn, we would be much closer to solving AI.

My Take

While Sutton's ideas are compelling, they also raise significant practical challenges. LLMs have demonstrated impressive performance on a wide range of tasks, and their scalability with compute is hard to ignore. However, the issues of data finiteness and bias are real and need addressing. Perhaps a hybrid approach that combines elements of both worlds, leveraging large datasets while incorporating continuous learning and intrinsic motivations, could be the way forward.

In any case, Sutton’s critique serves as a valuable reminder to keep questioning our assumptions and exploring different paths in AI research.