The Illusion of Thinking

AI models are getting pretty good at sounding smart. But are they actually thinking?

That’s the question behind a recent paper from Apple researchers, who compared regular language models (LLMs) to newer ones that are supposed to be better at reasoning—so-called Large Reasoning Models (LRMs). These are models that don’t just spit out answers. They talk through their thinking step by step, often using chain-of-thought and self-reflection.

But does all that extra “thinking” actually help? Turns out, the answer is more complicated than it seems.

Rethinking How We Test Model Reasoning

A lot of benchmarks used to test reasoning—like math problems or coding tasks—are probably in the model’s training data. This makes it hard to tell if the model is actually reasoning or just remembering something it saw before.

To get around this, the researchers came up with a new puzzle-based approach. The puzzles are easy to scale in difficulty but hard to memorise. These puzzles test planning, logical rules, and step-by-step execution.

Puzzles are a great fit for this kind of evaluation because they offer a high degree of control. Researchers can easily adjust the difficulty by tweaking simple parameters. The rules are also crystal clear, which removes ambiguity from the setup. And best of all, puzzles let us look inside the model’s reasoning process. By examining each step it takes, we can get a clearer sense of how (or whether) it's actually thinking through the problem.

How Model Performance Changes with Puzzle Complexity

The researchers ran both LRMs and regular LLMs through these puzzles at different levels of difficulty. The results were surprising:

For the easier puzzles, it turned out that regular LLMs were more efficient and effective than the more sophisticated LRMs. They were quicker, more accurate, and didn’t waste time overcomplicating the task.

On medium-difficulty puzzles, LRMs began to show their strength. Their ability to reason step by step gave them the edge in navigating problems that left simpler models stuck.

But once the puzzles got really hard, both types of models collapsed. Neither could maintain accuracy, and the LRMs did something unexpected—they started using fewer reasoning tokens as the challenges increased, almost like they gave up halfway through.

Claude Sonnet 3.7 vs. Claude Sonnet 3.7 (thinking)

Source: “The Illusion of Thinking,” Apple, 2024.

On medium-difficulty puzzles, LRMs began to show their strength. Their ability to reason step by step gave them the edge in navigating problems that left simpler models stuck.

When Reasoning Backfires and Overthinking Takes Over

One of the more curious behaviours the researchers noticed was something they described as overthinking. On simple puzzles, LRMs often started out strong, quickly landing on the correct solution. But instead of stopping there, they kept going—eventually veering off track and talking themselves into the wrong answer. With moderately challenging puzzles, they still found the solution, but only after a series of false starts and dead-end ideas. When faced with complex puzzles, however, the models simply stalled. They couldn't figure out a viable approach and never reached the correct answer.

What’s even more concerning is that these models don’t seem to learn from their own missteps. They don’t go back, reassess, or revise their thinking once they’ve taken a wrong turn. Once they’re off course, they stay off course.

This kind of overthinking and failure to self-correct is something many developers have experienced firsthand when using these models for coding. The model might start with a promising idea, but as the conversation continues, it can easily spiral into confusion, contradict earlier logic, or repeat unnecessary steps. It's like watching it get lost in its own train of thought, without ever retracing its steps.

Struggles with Following Instructions Even When Given the Answer

At first glance, it seems fair to assume that even if these models struggle to figure out solutions on their own, they should at least be able to follow a given algorithm. After all, executing a clear set of steps should be simpler than discovering them from scratch.

But that’s not what happened. In one of the experiments, the researchers provided the model with a correct and complete algorithm—for example, how to solve the Tower of Hanoi puzzle—and simply asked it to carry out the steps. Surprisingly, the model still failed, and at roughly the same level of complexity as when it was left to solve the problem on its own.

This result suggests a deeper limitation. It’s not just that the models have trouble discovering solutions; they also struggle with faithfully following through even when the path is laid out for them.

Even when Claude 3.7 Sonnet is given the correct Tower of Hanoi algorithm to execute, performance collapses at the same point as when solving it from scratch.

Source: “The Illusion of Thinking,” Apple, 2024.

Narrow Scope of the Evaluation and Its Implications

While the paper offers valuable insights, it focuses on a rather narrow view of AI reasoning. Most of the experiments centre on clean, symbolic, and rule-based puzzles domains where LLMs are traditionally weaker. Real-world reasoning problems often involve ambiguity, nuance, or context that doesn’t translate neatly into puzzles.

Another limitation is that the study emphasises execution rather than ideation. A model might struggle to carry out a specific algorithm step by step, but it may still be quite capable of generating the algorithm in the first place. That kind of conceptual creativity—coming up with good ideas even if they can’t be fully executed—isn’t really captured here.

Why More Thinking Does Not Always Mean Better Reasoning

This paper makes a strong case against the assumption that more reasoning steps lead to better outcomes. It turns out, generating longer explanations or thinking traces often just creates more opportunities for models to get off track.

The key takeaway here is that just because a model appears to be thinking, laying out detailed reasoning or engaging in step-by-step logic, doesn’t mean it’s actually reasoning effectively. What really matters is consistency and reliability, especially when a task becomes more complex.

At this point, we’re still far from having models that can reason in a robust and generalisable way. The gap between looking smart and being smart remains wide.

Sources: