In early July 2025, Elon Musk presented the world the Grok 4, a language model that, he said, reached “scientist level reasoning” after being trained at Xai’s supercomputer Colossus. More than generating fluid text, Grok 4 exposes its internal process: Before delivering an answer, it subdivides the problem into steps, tests alternative hypotheses, and only then provides the conclusion, as a researcher sharing each phase of the experiment.
This “reflection on stages” seeks to imitate the part of human thinking that is slow and deliberate, allowing to resolve complex challenges instead of depending only on immediate and instinctive answers. But how much does it go on a show?
Reflection or illusion of depth?
Recently, an article by Noam Brown, a former Metal Research and today at OpenAi, rekindled the discussion by pointing out that thought chains-however elaborate-they do not guarantee a real understanding. Brown recalls that behind current reasoning models (such as Grok 4 and Openi O1) there are still statistical correlations, not a mental model of the world. He points out that these IAS remains susceptible to hallucinations – creating false information with conviction – precisely because they “think” supported by data standards, not common experience or sense.
Continues after advertising
In practice, reasoning models like Grok 4 incorporate thought chains (chain-of-thought) that guide each sentence. It’s as if, before answering “x = 2” for an equation, the system asked internally “and if x was 3?”, “How does it impact Y?” And only then did you present the correct solution. Experiments show that giving a few more reflection to an AI can generate equivalent performance gains by multiplying 100,000 the volume of training data.
Limits of Artificial Reasoning
This advance brings real implications: increasingly reliable IAS in medical diagnoses may suggest treatments, financial reports now incorporate structured risk analyzes and scientific research gain assistants capable of proposing hypotheses based on recent publications. A virtual tutor with reasoning deciphering quantum theories and mathematical problems adjusting to each student level. In the business world, analysts have an AI that not only collects data but structure future scenarios and evaluates probabilities almost like a human strategist.
However, the same reflection that raises the potential of Grok 4 exposes its weaknesses. Activating deep reasoning mode consumes time and computational power – which, in very complex problems, can take seconds or even minutes, quite different from the speed of the human brain. And although you share your “thoughts”, the model has no conscience, empathy or its own values, repeating biases present in training data. If on the surface it shows each step, in the background continues to “guess” which is the statistically more plausible output, without understanding the real meaning of information.
Continues after advertising
Then we arrived at a decisive moment. We can use this ability to reason to accelerate discoveries and democratize knowledge, but it is urgent to teach our IAS how to reflect under ethical principles, with transparency and quality control – as critics such as Brown warn. Teaching machines to think – albeit differently from human – is also an invitation to improve our own critical thinking, questioning not only the answers we receive, but the form and limits of the reasoning we accept. In the end, perhaps the greatest lesson is this: by creating IAS that simulate scientific reasoning, we are challenged to refine our critical look, balancing technological ambition and social responsibility.