Reinforcement Learning Won’t Give Us AGI — But It May Be Good Enough

Reinforcement Learning Won’t Deliver AGI — But It Might Still Shape the Future of AI

Author: Aswin Anil

In a recent interview, Andrej Karpathy made a statement that surprised some and reassured others: artificial general intelligence, or AGI, is likely still a decade away. More importantly, he argued that reinforcement learning is not the key that will get us there.

Ilya Sutskever has echoed a similar sentiment in separate conversations. Coming from two of the most influential minds in modern AI, this perspective deserves attention.

Yet there is a contradiction that is hard to ignore.

Today’s large language models rely heavily on reinforcement learning to browse the web, write production-grade code, manage workflows, and even simulate running small businesses. If reinforcement learning is not the path to AGI, why does it sit at the center of so many recent capability gains?

Before anyone starts panic-selling GPU stocks, it helps to understand what is actually wrong with scaling reinforcement learning for language models.

The Core Problem With Reinforcement Learning

The biggest limitation of reinforcement learning for language models is not compute. It is information.

Traditional pretraining uses next-token prediction. Every generated token receives feedback. The signal is dense, continuous, and specific. The model learns not just the final answer, but every intermediate step that leads to it.

Reinforcement learning works very differently.

In most language model reinforcement learning setups, the model generates an entire sequence—sometimes hundreds of tokens long—and only receives a single scalar reward at the end. Right or wrong. Pass or fail.

A useful analogy is learning math with no step-by-step correction. You submit a full solution and only get told whether the final answer is correct. You might understand the method but make a small arithmetic error. The feedback gives you no clue where things went wrong.

This sparse signal introduces noise. A lot of it.

Karpathy’s criticism points directly at this issue. Reinforcement learning throws away most of the information that made pretraining so effective in the first place.

Why We Use Reinforcement Learning Anyway

If reinforcement learning is so noisy, why does the field keep using it?

The answer is exploration.

When a model no longer receives token-by-token supervision, it can optimize for outcomes instead of exact sequences. This freedom allows it to explore multiple reasoning paths, test alternatives, and occasionally discover strategies that supervised learning struggles to surface consistently.

DeepSeek’s well-known “self-correction” behavior emerged through reinforcement learning. Pretraining contains the knowledge needed for this behavior, but reinforcement learning makes it reliable.

Anything behavioral—tool use, planning, self-verification, agentic workflows—benefits from reinforcement learning.

That is why reinforcement learning keeps showing up, even if it is not elegant.

The Hidden Cost: Reduced Generalization

There is another issue that receives less attention.

Reinforcement learning sharpens the model’s sampling distribution. It pushes probability mass toward actions that maximize reward on a specific task.

This improves performance where you measure it, but it often reduces diversity and generalization elsewhere.

In other words, reinforcement learning makes models better specialists and worse generalists.

This tradeoff is acceptable for practical systems. It is far less acceptable if the goal is true general intelligence.

Supervised fine-tuning still lays the foundation. Reinforcement learning builds on top of it. It does not replace it.

Why Reinforcement Learning Alone Won’t Produce AGI

If AGI requires broad, transferable intelligence, reinforcement learning struggles to scale there.

It optimizes behavior without deeply expanding knowledge. It improves what the model already knows how to do, rather than teaching it fundamentally new concepts.

At its worst, reinforcement learning risks turning into a massive collection of conditional behaviors. The system feels intelligent, but only because it has been trained to respond correctly to an enormous number of situations.

That illusion can be powerful. It is not the same as general intelligence.

A Crucial Observation: Reinforcement Learning Barely Changes the Model

Recent research reveals something unexpected.

Reinforcement learning fine-tuning often modifies only a small fraction of a language model’s parameters. In some cases, as little as five percent.

Meta researchers and independent labs have observed that reinforcement learning traverses a structured optimization landscape. It makes targeted updates rather than sweeping changes.

This observation raises an important question.

If reinforcement learning only needs small, efficient updates, why fine-tune the entire model?

Enter LoRA: Low-Rank Adaptation

Low-Rank Adaptation, commonly known as LoRA, offers a compelling answer.

In standard fine-tuning, the model learns a massive update matrix with the same dimensionality as the original weights. This process is expensive and memory-intensive.

LoRA takes a different approach.

Instead of learning a full update matrix, it decomposes that update into two much smaller matrices. When multiplied together, they approximate the original update.

The base model stays frozen. Only the small LoRA matrices train.

The result is a compressed representation of the learned behavior.

Is LoRA as Good as Full Fine-Tuning?

This question matters.

If LoRA fundamentally limits performance, it cannot replace full fine-tuning for serious reinforcement learning.

A recent analysis from Thinking Machines addressed this directly in a post titled “LoRA Without Regret.”

The conclusion was clear: under the right conditions, LoRA can match full fine-tuning.

It struggles when the task resembles pretraining, where large amounts of new knowledge must be absorbed. But reinforcement learning does not operate in that regime.

Reinforcement learning delivers sparse information.

Why Reinforcement Learning Fits LoRA Perfectly

From an information theory perspective, reinforcement learning carries very little signal.

Each episode provides roughly one bit of information: success or failure.

DeepSeek R1, for example, trained on approximately 5.3 million reinforcement learning episodes. That amounts to about 5.3 million bits of signal.

Even a rank-one LoRA adapter contains millions of parameters.

The capacity is more than sufficient.

Thinking Machines tested this hypothesis empirically. They applied reinforcement learning using LoRA adapters and compared it to full fine-tuning.

The learning curves overlapped.

Even at rank one, performance was indistinguishable.

The Rules for Making It Work

This equivalence holds only under specific conditions.

First, LoRA must apply to all relevant layers, not just attention. Limiting adapters to attention layers alone significantly underperforms.

Second, learning rates matter. Optimal LoRA learning rates tend to be about ten times higher than those used for full fine-tuning.

Third, LoRA is less tolerant of extremely large batch sizes. Pushing batches too high destabilizes optimization.

For reinforcement learning, these constraints are manageable.

The computational benefits are significant. LoRA-based reinforcement learning can reduce floating-point operations by roughly one third.

What This Means for the Future of AI

Reinforcement learning combined with LoRA changes the economics of experimentation.

Researchers can iterate faster. Companies can personalize models cheaply. Providers can swap behaviors modularly.

Imagine toggling advanced reasoning on demand by activating a specialized adapter. The core model remains untouched.

This is not AGI.

But it may be good enough.

AGI or Not, This Path Still Matters

Reinforcement learning may never unlock true general intelligence.

That does not make it irrelevant.

It boosts productivity, enables personalization, and pushes practical systems forward at a pace that matters today.

If 2025 was the year of AI agents, 2026 may be the year of performative, personalized agents—powered by efficient reinforcement learning and modular adaptation.

AGI can wait.

The impact cannot.


Sources:

  • Andrej Karpathy interviews and public talks
  • Ilya Sutskever interviews on AI scaling
  • Thinking Machines Blog – “LoRA Without Regret”
  • Meta AI research on reinforcement learning fine-tuning
  • DeepSeek R1 technical reports