Finding rules from a handful of examples: the skill that still resists AI

AI is famous for being extremely good at logic, yet it fails certain logic tests. Why?

In 1936, a Briton wanted a fast way to measure his patients' mental abilities without running into cultural barriers. So he built a set of visual tests that turned out to be remarkably good at measuring people's eductive ability — that is, the knack for grasping relationships, making sense of disorder, or deducing rules. He created Raven's Progressive Matrices. His first client was... the British army. In 1942, in the thick of the World War, every soldier sat this 20-minute test. It let the army figure out very quickly who was cut out for technical tasks, and who had the potential to become an officer. A manager, we'd say today.

Today, a Frenchman, François Chollet, decided to adapt it for AI. The test is called ARC-AGI (the Abstraction and Reasoning Corpus), and it draws on Raven's matrices to build a test that AI can't pass.

It measures a skill you call on all the time, whenever you run into a problem: your ability to spot a complicated pattern that repeats.

Here's one puzzle, in its original form. Two pairs show you a transformation. It's up to you to guess what the third grid becomes.

Test · Abstraction · Easy#ARC-6150A2BD

Why does abstraction ability become critical in 2026?

Abstraction ability is inferring a new rule from a few examples:

You know birds sing loudly in the morning
A client who's slow to reply isn't very interested
When your partner goes quiet, something's bothering them

You scrape together scraps of information here and there, and you turn them into rules. And in 2026, this skill changes everything — because AI already does all the rest.

Writing up a report, summarizing a file, coding something that's been done a thousand times before: the models do it, often better than you. What's left to the human is the zone where there's no model to copy yet. Filling in a pre-formatted spreadsheet, AI can handle. Adding the column that's missing, far less so.

This is the thesis François Chollet — AI researcher and creator of the Keras library — defended in his 2019 paper On the Measure of Intelligence. In his view, intelligence isn't measured by performance on known tasks, but by how efficiently you acquire a new skill when facing the unknown. A model that has read everything isn't intelligent by that standard. It's well-read. The distinction is subtle!

What the models can do, and what resists them

On abstraction from a handful of examples, the models long posted a score close to zero where humans succeeded effortlessly. ARC-AGI was published in 2019. For nearly five years, classic language models stayed stuck below 5%, while a human panel solves the vast majority of the puzzles.

The contrast is documented. In the H-ARC study (Solim LeGris and colleagues, New York University, 2024), 1,729 people took the test. On average, they solve 64.2% of the puzzles in the public evaluation set, and 790 of the 800 tasks were solved by at least one person. The test is easy for us, hard for the machine.

In late 2024, one model broke through that ceiling. OpenAI's o3 system reached 87.5% on the test, against a few percent for its predecessors. The catch is in the bill. According to a tally by engineer Simon Willison, it took nearly $6,700 of compute for 400 puzzles in the model's most frugal version, and on the order of a million dollars for the most performant version, which consumes 172 times more. Above all, o3 doesn't guess the rule at a glance: it works by program search, exploring thousands of possible lines of reasoning until it finds one that fits — a kind of brute force disguised as intuition.

And what followed confirmed the pattern. In 2025, Chollet and his team published ARC-AGI-2, a version designed to stay easy for humans and hard for machines. At its release, humans solve 75% of the puzzles on average, in 2 to 3 minutes each, while the most advanced systems dropped back to a few percent. We're now on a third version, harder still for machines. Our test, for its part, builds on the original corpus — the one the models stumbled over for five years. Will your candidates be able to solve what AI still struggles to intuit?

The use case: hiring for problems with no instruction manual

For an employee, the risk isn't that AI stays silent. It's that it answers even when it isn't sure. Faced with a genuinely new problem, the model reaches into its vast library for the page that looks closest, and serves it to you with confidence. When the right page doesn't exist, it makes one up, in the same assured tone. It's often right, and sometimes wrong. You won't be able to tell, because the AI will have the same confidence either way. A test on "overconfidence" is coming soon to Smarter Than AI.

Take a concrete situation. You're an IT services firm hiring a junior consultant for an assignment with a client. Part of the work is well mapped out, and AI helps with that nicely. The company is fairly standard and runs into ordinary difficulties, the kind the consultant saw during their coursework or apprenticeship. But when confronted with new situations, your consultant has no instruction manual to consult. So they'll analyze the little data they have, draw on their knowledge, read the emotions (two tests on emotional understanding are being finalized and will be available soon): in short, they'll try to understand what's going on, and to make sense of the disorder they observe. And that's where this test is critical.

Earlier, two candidates applied. In the interview, both have the vocabulary down, cite the right frameworks, have credible experience. The difference doesn't show up in conversation, but one of them can rebuild the logic of a situation they've never seen from three clues. The other copies the AI's confident answer, without noticing it doesn't grasp what's happening.

Without the right assessment	With the right assessment
You ask the usual questions ("tell me about a hard problem you solved"). Both candidates answer well.	You put the candidate in front of a fresh problem, with no procedure, with only a few examples, and you watch whether they infer the rule.
You don't see them forcing a learned pattern when the situation doesn't fit it.	In twenty minutes you see whether they rebuild a brand-new logic or force a known model onto a case it doesn't fit.
You hire them with a blind spot. It's the first assignments that will reveal whether they can handle the unexpected or not.	You hire them (or not) knowing where they stand, and you target their support from the very first month.

How to assess abstraction ability, and why it's hard

The usual HR tools measure this skill poorly. A degree certifies acquired knowledge, an interview is often a reflection of practice, a standard logic test rewards recognizing formats already seen. None of them puts the candidate in front of a genuinely new rule to infer from almost nothing. And yet that's THE situation where the human keeps the edge over AI. Analyzing what has never been analyzed before.

Our approach starts from a published foundation. The abstraction test builds on ARC-AGI, the corpus built by Chollet and studied by an entire research community since 2019.

We extend this work by replaying the puzzles against recent models (Claude, GPT, Gemini), under documented conditions, before presenting them to a human candidate. We've also calibrated these exercises — originally meant to be played by AI — so that they're playable by humans. It's the only way to know where the human–AI gap sits. Our data dates from spring 2026 and will be redone with each new generation of models.

The test doesn't claim to measure a candidate's general intelligence, nor to predict their success in the role. It gives a measure of one precise thing: their ability to make sense of disorder, when there is sense to be made. Personality tests, situational exercises, and interviews keep their full place alongside it.

Going further

Want to see the test from the inside?

The first option: take it yourself. A few grids to transform, the rule to guess, and a comparison of your score with that of the AI models. To keep candidates from practicing before their assessment, access to the tests is reserved for recruiters who are already subscribed. Ask to see the tests→

The second: let's talk! Book a meeting and ask all your questions. Ask your questions →

On a related theme

If the angle speaks to you, two other skills AI handles poorly: sales negotiation, where the models know the theory and still lose money, in our article why AI loses at negotiation, and adapting to uncertainty, where they stay frozen when the context shifts, in why AI stays frozen when the market moves.

Frequently asked questions

Is AI really incapable of abstract reasoning?

No, but with two caveats. In late 2024, the o3 model reached 87.5% on the first version of the test, at the cost of colossal compute and through program search rather than intuition. And as soon as a more demanding version of the test came out, the gap with humans reopened. On genuinely new cases, the human advantage still holds, as of the date we're writing this article (06/2026).

Isn't a classic IQ or logic test enough?

Not for this skill. Classic tests reward format recognition and practice. Abstraction ability in the ARC-AGI sense measures something else: inferring a brand-new rule from a few examples, without being able to lean on a familiar type of exercise.

Should AI be banned from these tasks?

Up to you, but here's our take: no, but it needs to be overseen by a human. AI stays useful everywhere the problem resembles something already seen, and that's the majority of the work. The challenge is to spot the people who know when to follow it and when to take back the wheel, at the precise moment the problem becomes new.

Sources

Chollet, F. (2019). On the Measure of Intelligence. arXiv:1911.01547. Link
LeGris, S., Vong, W. K., Lake, B. M., Gureckis, T. M. (2024). H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark. arXiv:2409.01374. Link
Chollet, F., et al. (2025). ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems. arXiv:2505.11831. Link
ARC Prize (2024). OpenAI o3 Breakthrough High Score on ARC-AGI-Pub. Link
Willison, S. (2024). OpenAI o3 breakthrough high score on ARC-AGI-PUB. Link
ARC Prize. ARC-AGI-1 (benchmark overview). Link