RP-2026-0001Vision

The hiring test is broken

Traditional coding interviews and behavioral panels were built for a world where the candidate worked alone. That world ended.

Published
May 23, 2026
Reading time
8 min read
Author
Acta Research

For thirty years, the structure of a hiring interview was a deal between you and the candidate: leave your tools at the door, and we will measure you. The whiteboard interview said so explicitly. The take-home was a softer version of the same contract: write this code yourself, and we will treat it as evidence. Behavioral panels traded the deal sideways: tell us a story about you, alone, doing the thing.

That contract is broken.

Not because candidates have started cheating. Because the work itself has changed. The job they are interviewing for is no longer the same job the interview measures. The deliverable a Q3 analyst now ships is a document the candidate could not have produced alone in 2022 (and could not produce alone in 2026 either). The interview is auditioning for a role that no longer exists.

0%
Knowledge workers using AI in workflow

McKinsey’s 2025 State of AI survey found 88% of knowledge-worker respondents now use generative AI in at least one workflow each week, up from 33% two years earlier.

What the old test actually measured#

The whiteboard interview was a good proxy for one thing: whether the candidate could hold a complex problem in their head and reason about it without external scaffolding. That trait still matters. But it was always a proxy. The signal a hiring manager wanted was not "can you write a balanced BST on a wall in twenty minutes." It was "can you ship the kind of work this team needs to ship." The wall was a stand-in for a workstation. The marker was a stand-in for an IDE. The forty-minute clock was a stand-in for a sprint.

When the workstation was the IDE and the IDE was a text editor, the proxy worked. When the workstation now includes a colleague-in-a-text-box who completes your work in 4-second turns, the proxy collapses. You are now measuring whether the candidate can hold a complex problem in their head without their colleague. And then hiring them for a job where that colleague never goes home.

The same collapse holds for the analyst's take-home, the PM's spec exercise, the marketer's brief. The proxies were good enough for the world they were built for. They are now testing the absence of the most consequential tool in the candidate's actual workday.

The "no-AI interview" is a category error#

The standard response, when an interview proxy stops predicting performance, has historically been to tighten the proxy. Forbid Google. Forbid Stack Overflow. Forbid the calculator. Each one earned its ban. Each one is now ridiculous in retrospect: the proxy was tightened against a tool, not against the work the tool let you do.

The 2024–2025 wave of "AI-free" interview tooling (IDE lockdown, screen pinning, AI-watermark detectors) is the same move. It is an attempt to preserve a proxy by banning the thing the proxy was originally designed against. It is not measurement. It is theater that produces the comforting illusion that the test still measures what it once measured.

The literature on hiring validity has been telling us this since before LLMs were called LLMs: a structured task that resembles the actual job is a stronger predictor of performance than a structured task that resembles the interview tradition. The job has changed. The interview has not.

What changed in 2023 (and why hiring did not catch up)#

Three things happened in fast succession.

First, AI assistance became universal in knowledge work. McKinsey's State of AI series shows a step-function adoption curve between Q2 2023 and Q1 2025, with the share of knowledge workers using AI in at least one weekly workflow crossing from a third to roughly seven in eight. Second, the productivity dispersion across workers using AI widened. The same study found the variance in output quality increased as raw adoption rose. Some workers got dramatically better; others got measurably worse. Third, the productivity gain itself turned out to be sharply user-dependent. Brynjolfsson, Li & Raymond's field experiment in customer-support work found a 15% average productivity gain, concentrated almost entirely in the lower half of skill distribution, with senior workers showing essentially no gain and, in some sub-tasks, a regression.

The implication is uncomfortable: AI does not uniformly help. It amplifies whoever is using it, in whatever direction they were already going. A literate user with good instincts gets a force multiplier. An over-trusting user with weak instincts gets a faster way to produce confidently-wrong work. The hiring test that cannot tell the two apart is now writing checks the team will cash for the next twelve months.

The new test has to be about collaboration#

If the work has become a collaboration, the test has to be about the collaboration. That is not a marketing line. It is a measurement claim with specific implications.

It means the candidate has to have the tool in the room.

It means the test has to be a real workflow (the candidate's actual job in miniature), not a contrived puzzle that incidentally permits AI. A whiteboard problem with a chatbot bolted on is a whiteboard problem.

It means the work has to be genuinely demanding. Real AI is wrong sometimes, in the ways production AI is wrong: fabricated citations, off-by-a-year data, math that almost works. A test worth running puts the candidate in front of that reality. If the candidate cannot tell when to push back, the test is not measuring AI collaboration. It is measuring delegation.

And it means the measurement has to capture the decisions, not just the artifacts. What the candidate accepted, what they rejected, what they asked for again, what they let go. The deliverable is the candidate's score on a workflow. The signal is the candidate's score on the interaction.

Hire for AI collaboration, not AI avoidance.

, The framing Acta hires forVision · 2026

What good tests of AI literacy actually look like#

We will be the first to admit the literature here is younger than the problem. The AICOS (AI Competency Objective Scale) is the most-validated published instrument for AI literacy as of writing, with documented construct validity against MAILS and convergent validity against task performance. Acta's scoring rubric is anchored against AICOS by design, not because the scale is perfect, but because anchoring against a published instrument is the only way to talk about validity at all.

The shape of a good AI-literacy test, distilled:

  • Realistic. The task is a workflow the team actually runs.
  • Tool-positive. The candidate brings the tool, and the work surfaces the kinds of mistakes production AI makes.
  • Decision-rich. What the candidate accepts, rejects, and questions is captured turn-by-turn, not just at the artifact level.
  • Calibrated. The trust composite has to fall out of the data: over-trust and over-rejection have to be visible and equally accountable.
  • Auditable. Every claim about predictive validity has to be checkable against retest, AICOS short-form, and a bias audit. If the platform cannot show you the audit hooks, the platform does not have them.

This is not a description of an Acta session by accident. It is the test we built when we accepted the test we used to run was now testing the absence of the job.

The cost of running the old test for one more year#

A mis-hire in 2019 was expensive. A mis-hire in 2026 is more expensive, because the cost is now compounded by what the mis-hire produces with the tool. The hire who over-trusts a fabricated SEC reference does not produce nothing. They produce a confident, well-formatted report that goes to the board with the wrong number. By the time the wrong number is caught, the cost is the report's blast radius, not the salary.

In the next article, we look at the corollary: why the most-common response to this collapse (banning AI from the interview) is the worst response.

In the methodology page, we walk through the academic work the Acta score sits on top of.

The old test ended in 2023. Most hiring teams are still running it. The cost of running it for one more year is one more year of the wrong hires looking right, on paper, while the work they will actually do remains untested.

References

  1. 01McKinsey & Company The State of AI in 2025. McKinsey Global Survey, 2025.Read source
  2. 02Brynjolfsson, E., Li, D., & Raymond, L. Generative AI at Work. Quarterly Journal of Economics, 140(2), 889–942, 2025.Read source
  3. 03Long, D., & Magerko, B. What is AI Literacy? Competencies and Design Considerations. CHI ’20: Proceedings of the 2020 CHI Conference on Human Factors, 2020.Read source
  4. 04Markus, J., Carolus, A., & Wienrich, C. Objective Measurement of AI Literacy: Development and Validation of the AI Competency Objective Scale (AICOS). arXiv:2503.12921, 2025.Read source
  5. 05Carolus, A., Koch, M., Straka, S., Latoschik, M. E., & Wienrich, C. MAILS, Meta AI Literacy Scale: Development and testing of an AI literacy questionnaire. Computers in Human Behavior: Artificial Humans, 1, 100014, 2023.Read source
  6. 06Bansal, G., Nushi, B., Kamar, E., Lasecki, W. S., Weld, D. S., & Horvitz, E. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. Proceedings of the AAAI HCOMP 2019, 2019.Read source