Do AI Models Really Think? Apple, Claude, and the Truth About Reasoning LLMs

Patrick Law
Jun 15
2 min read

Can today’s AI models really think, or are they just good at guessing? A recent paper from Apple says no, sparking a response from an AI model itself. Here’s what engineers need to know.

Apple’s paper “The Illusion of Thinking” raised an important question: are reasoning language models (LLMs) like GPT-4, Claude, or Gemini actually thinking, or are they just matching patterns? To find out, Apple tested them using classic logic puzzles like:

Tower of Hanoi
River Crossing
Blocks World
Checkers Jumping

These tasks were chosen for their complexity and history in AI research. The models had to plan multiple steps ahead and explain their thinking using chain-of-thought reasoning. Apple observed that as the tasks got harder, the models failed—both in accuracy and in the length of their reasoning steps. It looked like they were “giving up.”

The paper gained major attention. On the surface, it suggested today’s most advanced models weren’t truly capable of reasoning and certainly not on the path to AGI.

However, backlash came quickly—and credibly.

A new paper, “The Illusion of the Illusion of Thinking,” co-authored by independent researcher Alex Lawsen and Anthropic’s Claude Opus 4, argued that Apple’s tests were flawed. Critics pointed out:

No human baseline was provided. We don’t know if people also fail these tasks without tools.
Token limits caused models to “run out of space” when outputting long solutions.
Correct reasoning was often marked wrong due to rigid evaluation criteria.
Some puzzles were mathematically unsolvable, yet still scored.

When allowed to answer differently—using compressed formats like code or logic functions—the same models solved more complex versions of the problems. This showed the models could reason but were limited by how the test was designed.

What It Means for Singularity’s Engineering Workflow

This debate isn’t academic—it impacts real-world engineering workflows.

If you’re building AI into systems for:

Troubleshooting or diagnostics
Multi-step planning
Control system logic
Process simulation agents

You might see failures that aren’t actual reasoning errors. The model may simply forget previous steps, exceed its memory, or get penalized for outputting answers in the wrong format.

What helps?

Break tasks into smaller prompts
Use function-based outputs or code instead of long text
Store prior reasoning steps externally

At Singularity, this reinforces our approach: don’t assume model failure equals model weakness. Design the prompt, memory, and outputs to match the job. Good engineering tools require good engineering inputs.

Want to build smarter AI tools for real-world tasks? Advance your AI skills with our Udemy course – Singularity AI for Engineers

Do AI Models Really Think? Apple, Claude, and the Truth About Reasoning LLMs

What It Means for Singularity’s Engineering Workflow

Recent Posts

Comments