What Is the Clever Hans Effect in Math?

The Clever Hans effect describes a situation where someone (or something) appears to solve math problems correctly but is actually picking up on subtle cues rather than doing real math. The name comes from a horse in early 1900s Germany that seemed to perform arithmetic, and the concept now applies broadly to classrooms, standardized testing, and artificial intelligence.

The Horse That Could “Do Math”

Clever Hans was a horse whose owner claimed it could add, subtract, multiply, and even work with fractions. The horse would tap its hoof the correct number of times to answer math questions, and audiences were astonished. For years, no one could figure out the trick, partly because the owner genuinely believed the horse understood mathematics.

In 1907, psychologist Oscar Pfungst ran a series of careful experiments and cracked the mystery. He discovered two critical things. First, Hans couldn’t answer any question when the person asking it didn’t already know the answer. Second, the horse failed completely when a screen blocked its view of the questioner’s face. Hans wasn’t doing math at all. He was reading tiny, involuntary changes in the questioner’s body language: a slight lean forward, a shift in posture, a subtle change in facial tension as the hoof taps approached the right number. The questioner had no idea they were giving these signals.

How It Shows Up in Math Education

The same dynamic plays out in classrooms and tutoring sessions every day. A teacher asks a student to solve a problem, and the student starts working through it out loud. As the student gets closer to the right answer, the teacher nods slightly, leans in, or changes their tone of voice. If the student heads in the wrong direction, the teacher’s expression tightens or they pause. The student learns to read these cues and arrives at the correct answer without genuinely understanding the underlying math.

This creates what educators sometimes call pseudo-competence. The student appears to grasp the material during guided practice, but falls apart on independent work, standardized tests, or when a different teacher asks the same type of question in an unfamiliar way. The pattern is especially common in one-on-one tutoring, where the student has a single person’s body language to focus on, and in oral problem-solving, where the back-and-forth creates plenty of opportunities for unintentional signaling.

It also happens at a structural level. Students learn that the answer to a word problem is almost always found by using the operation they just learned that week. They don’t need to understand the problem; they just apply whatever technique was covered in the current chapter. The textbook’s organization becomes the cue, much like the questioner’s posture was the cue for Hans. When these students encounter mixed-format tests or real-world problems that don’t come pre-labeled by topic, their performance drops sharply.

The Clever Hans Effect in AI and Machine Learning

The term has taken on a second life in artificial intelligence research, where it describes AI systems that appear to solve math problems (or other tasks) by latching onto patterns in the data that have nothing to do with actual reasoning. A 2019 study published in Nature Communications found a striking example: an image-recognition model trained to identify horses in photographs wasn’t actually recognizing horses. Instead, it had learned that a small copyright watermark in the corner of the image reliably appeared on horse photos. The watermark was an artifact of how the dataset was assembled, and the model exploited it as a shortcut. It performed well on the benchmark but would fail in real-world use where that watermark wouldn’t be present.

The researchers described a “surprisingly rich spectrum” of these behaviors across different AI systems, all of which went unnoticed by standard performance metrics. In other words, the AI looked competent on paper but was doing the equivalent of reading the questioner’s body language.

This matters for AI math performance specifically. A 2024 study on large language models found that several major model families, including those from OpenAI, Meta, and Mistral AI, performed measurably better on benchmark questions that could be predicted by simple text patterns alone. The implication is that some portion of what looks like mathematical reasoning may actually be pattern-matching on superficial features of the question: the way it’s worded, the format of the answer choices, or statistical regularities in training data. When those surface cues are removed or scrambled, performance drops.

How Researchers Detect It

In AI, detecting the Clever Hans effect requires looking beyond overall accuracy scores. One common approach is subgroup analysis: breaking the test data into slices and checking whether the model performs well across all of them or only on certain types of questions. A model that aces one subset but fails on a closely related one is likely relying on a cue specific to that subset rather than genuine understanding.

Another method is out-of-distribution testing, which gives the model data that looks different from what it trained on. If the model truly learned mathematical reasoning, it should handle novel problem formats. If it learned shortcuts, its performance collapses. Researchers also use attribution maps, which visualize which parts of the input the model pays attention to. When a math-solving AI focuses on irrelevant features of a problem (like formatting or position of numbers) rather than the mathematical relationships, that’s a Clever Hans signal.

Counterfactual testing offers another window: researchers make small, mathematically irrelevant changes to a problem (rewording, reordering answer choices) and watch whether the model’s answer changes. A system doing real math wouldn’t be affected by cosmetic changes. One relying on surface cues often is.

Why It Matters for Learning and Assessment

Whether the “student” is a child in a classroom or a neural network on a benchmark, the Clever Hans effect points to the same fundamental problem: correct answers don’t always mean correct understanding. The effect is particularly insidious because it’s invisible to the person giving the cues. Hans’s owner wasn’t trying to cheat. Teachers who nod at the right moment aren’t trying to give answers away. AI training pipelines don’t intentionally include shortcut-friendly artifacts. The cues emerge naturally and go unnoticed unless someone specifically designs a test to catch them.

For math specifically, this has practical consequences. A student who has learned to read social cues rather than solve equations will hit a wall when the math becomes complex enough that cue-reading can’t keep up. An AI system that exploits formatting patterns will produce confident, wrong answers when deployed on real problems. In both cases, the fix is the same principle Pfungst used in 1907: remove the cues and see if the ability survives. For students, that means independent problem-solving with unfamiliar formats. For AI, it means testing on data that strips away the statistical shortcuts the model might lean on.