Ask Claude how to steer your sales strategy with five new clients. The answer will be structured and well-argued, with credible recommendations. Every so often, you tell it what's happening and you ask whether you should change course. Generally, if things were going well before, it'll tell you to change nothing.
This tendency has a name in cognitive science research. It's called the absence of directed exploration (the ability to deliberately try an uncertain option in order to reduce your uncertainty about it). And it's probably one of the skills that makes human genius what it is. At a moment when AI agents are making more and more decisions on our behalf, it's important that you understand the fundamental bias of an AI when it faces uncertainty.
Why adapting to uncertainty becomes critical in 2026
The use of AI agents to steer trade-offs exploded in 2025-2026. Allocating an advertising budget, prioritizing support tickets, managing a sales pipeline, choosing suppliers, automatically shortlisting candidates in an ATS. AI no longer replaces only administrative tasks, it also makes micro-decisions. It ingests your data and recommends a decision. It looks sure of itself, and after all, it's your data that it ingested, so why not trust it?
Because there's an asymmetry that most product teams never put into words. LLMs were trained on frozen data. The majority of their reasoning benchmarks cover stable, well-defined situations. What they can't do reliably is handle an environment that drifts. A market that turns. A client whose needs evolve. A team whose strengths transform.
Marcel Binz and Eric Schulz, of the Max Planck Institute, were the first to show this rigorously. In a paper published in PNAS in 2023, they put GPT-3 through a battery of canonical cognitive-psychology tasks. Protocols validated on humans for fifty years. The verdict: on tasks in a stable environment, GPT-3 matches or beats humans. But it shows no trace of directed exploration. And small perturbations throw it off spectacularly.
Three years later, at Smarter Than AI, we re-confirmed the gap even though the models have improved considerably.
What AI knows, and what it doesn't do
The most complete study to date comes from a Toronto team (Zhang, Wang, Chen, Mansur, Sarhangian), published in May 2025. They compared GPT-4, Gemini 1.5, DeepSeek-V3 and human participants on a task well known to cognitive science. A setup of several slot machines, where each one pays out an uncertain reward. The player decides each turn which machine to pull. Two versions of the test: a stationary one (each machine has a fixed quality), and a non-stationary one (the quality changes continuously).
That's the test we replicated, and understand that we can't reveal exactly what the change is publicly, because candidates who read the blog would be at a real advantage.
The most important point is that our result reproduces Binz & Schulz, in finer detail.
In a stationary environment, GPT-4 with explicit reasoning ("thinking" mode) reaches human level. The machine explores, learns, and does no worse than a near-optimal strategy.
In a non-stationary environment, the picture collapses. The models, even in thinking mode, don't reach human level. The authors conclude that LLMs struggle to match human adaptability when the environment changes. We humans understand uncertainty in a finer way than LLMs do.
Two other studies in 2025-2026 confirmed how robust the finding is. One team had GPT-4, Gemini 1.5 and DeepSeek-V3 play against an opponent who changes their strategy. The models over-commit prematurely, and freeze quickly (Adversarial Testing, May 2025). More recent work shows a near-mechanical rigidity of LLMs on the multi-armed bandit (the scientific name for the slot-machine test): an early lock-in on one option, and a transformation of random noise into a persistent bias (Rigidity in LLM Bandits, early 2026).
The pattern is everywhere the same. AI loves what has worked and thinks it'll keep working forever. AI would have poured all its gold into Nokia and Kodak.
The use case: your best strategist is the one who knows when to ignore AI
Now picture the scene. Your marketing manager walks up to a dashboard showing that LinkedIn Ads, their best channel for eight months, just doubled its CPM. TikTok, which they'd written off, became profitable overnight. They have twenty minutes to propose a reallocation. They open ChatGPT and describe the situation.
With a non-negligible probability, the model will suggest sticking with the current setup. The recommendation is cautious but doesn't value exploration. That decision is costly, because it ignores the signal the marketer has right in front of them: the channel they knew is eroding while the other takes off. They might not see it yet, but they could sense it. The AI doesn't see it, and can't see it.
Now let's go back to the moment you hired your marketing manager. You assessed them in an interview. The vocabulary was probably on point and they had solid professional experience. But you may have missed a profile who could read the numbers in a way AI can't read them. And at a time when more and more of the analysis is delegated to AI, this skill becomes crucial for certain roles.
How do you tell them apart?
| Without the right assessment | With the right assessment |
|---|---|
| You test in the interview with a static case study. Both candidates answer well. | You put the candidate in front of an environment that changes while they decide. You measure how they adjust their choices when the context flips. |
| You hand them the acquisition budget. It'll take you six months to discover they stuck with the day-1 strategy even though the market changed. | You measure in fifteen minutes their ability to explore an uncertain option when the exploited option loses performance. |
| Hard to estimate the gains never made; you just see suboptimal performance, without knowing it. | You hire them with full awareness, with their coaching areas identified from the first month. |
How we assess this skill
The literature on exploration-exploitation as a test of human-AI complementarity is mature. The paradigms used have been cognitive-science standards since Wilson et al. 2014. They're calibrated on decades of human data. Our approach rests on two foundations.
First, we start from a proven paradigm. The Slot Machines test is adapted from the protocol published by Daw et al. in 2006 and recently replayed by Zhang et al. in 2025. It puts the candidate in front of a sequential-decision environment where the quality of the options changes continuously, and where the change does follow a human rule, but one that's nearly impossible to guess. The format has been validated on dozens of configurations in academic research.
Second, we confront this paradigm with the recent models. Before a test is presented to a human candidate, we replay it against Claude, GPT, Gemini under documented and controlled conditions. It's the only way to know where the human-AI gap crystallizes on the models recruiters and their candidates have on hand. The benchmarks for Slot Machines are currently being measured (calibration May 2026).
Our test doesn't claim to be enough to measure a candidate's overall performance. You could rely on a personality test to spot stable profiles, or run case studies in the interview. These tools keep their value. We propose adding an essential building block that lets you test your candidate's added value compared with an AI-generated analysis.
Going further
Want to try this test?
The second option: integrate this test into your next hire: Create a campaign →
On a related theme
The other skill where AI falls apart the moment it leaves theory behind: sales negotiation. It recites and talks like a shark but acts like a rabbit. Read our article on the Sales negotiation skill.
Frequently asked questions
Why do LLMs struggle in an unstable environment?
Because they were pre-trained on frozen data. They haven't developed a native mechanism to detect that a signal has changed. When you feed them new data that contradicts the story they've built, they tend to fold in the novelty half-heartedly rather than revise their belief. That's what Marcel Binz called the absence of directed exploration in GPT-3 in 2023. The phenomenon persists on recent models in a non-stationary environment.
Does the "thinking" mode of recent models increase their exploration?
It dampens it. The study by Zhang et al. in 2025 shows that turning on explicit reasoning, meaning explicitly asking them to justify their answer, brings LLMs closer to human behavior in a stable environment (efficient exploitation). But in an unstable environment, thinking isn't enough to close the AI-Human gap. The model reasons better about what it's doing, but it keeps downplaying the value of exploration.
Which roles are affected as a priority?
Any role that involves sequential decisions in an environment that changes. Top of the list: marketing, trading, buyer, account manager, product manager, strategist, consultant. At the opposite end: a role with a stable process and fixed rules calls on this skill very little.
Is the assessment robust? What happens if a candidate who took the assessment talks with another candidate who hasn't taken it yet?
We designed a set of sequences, tested and mathematically valid, to present a random character without falling into the absurdity of the edge cases that pure randomness can produce. Each session randomly draws a sequence, and we never reveal the number of the sequence drawn. No winning path can be learned in advance or shared between candidates. The underlying paradigm is known to the specialized academic community but almost unknown to the general public, and the precise environment each candidate sees is unique.
Can someone cheat on your test by using AI on another device?
Using AI, whatever the model, will sharply degrade your performance.
Sources
- Binz, M. & Schulz, E. (2023). Using cognitive psychology to understand GPT-3. PNAS, 120(6), e2218523120. Link
- Zhang, Y., Wang, X., Chen, S., Mansur, R., Sarhangian, V. (2025). Comparing Exploration-Exploitation Strategies of LLMs and Humans. arXiv:2505.09901. Link
- Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A., Cohen, J. D. (2014). Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology: General, 143(6), 2074–2081.
- Daw, N. D., O'Doherty, J. P., Dayan, P., Seymour, B., Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441, 876–879.
- Adversarial Testing in Cognitive Tasks (2025). arXiv:2505.13195. Link
- Google DeepMind. (2025). LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities. arXiv:2504.16078. Link
