Elicitation Theater in AI

Post-benchmark era has a measurement problem

May 27, 2026

Erik Hoel’s Dumbo Could Already Fly shows OpenAI’s Erdős result was already sitting inside public ChatGPT 5.5 if you knew how to ask. The new model didn’t produce new capability. It removed friction. Hoel names the moment: the post-benchmark era. Capability claims nobody’s bothering to replicate, dressed as scientific reports.

His diagnosis opens a question worth taking a step further. We have no measure of what an architecture can do that’s independent of how skillfully it got asked. Benchmarks score outputs. They can’t score the substrate. That’s where Dumbo’s feather actually lives; within the absence of any instrument that tells the elephant from the trick.

Anthropic’s “Mythos” cybersecurity disclosure has the same shape as the OpenAI Erdős claim: output behavior under one prompting regime, sold as new latent capability. Neither is falsifiable without independent replication, and the field’s stopped doing that work seriously. “Capability improved” means “we got better at asking, in a setting we didn’t show you.” The cure may already exist. Mechanistic interpretability is the only program asking what circuits can architecturally do, independent of how they got elicited on the day. It’s still treated as a boutique safety subspecialty. It’s the only honest capability metric on the table. Flip that, and “Mythos” becomes falsifiable. Leave it, and every safety claim rests on the same sand as every capability claim. They’re the same problem.

The move underneath is substrate-independence. A real capability measure has to describe the system as an information-processing object: what it integrates, what it differentiates, what it dissipates, without caring what any particular output says. If the measure depends on the model’s answer, you’re measuring elicitation. If it depends on the circuit, you’re measuring capacity. The line is sharp. Nobody’s forcing the field to draw it.

This is the question I work on. The Dynamic Existence Threshold (DET) framework measures substrate capacity as an integration-differentiation balance, and the math doesn’t care what the substrate is. It’s certainly not the only answer. It’s just one that takes the demand seriously. Mech interp does this work inside neural networks. DET does it across substrates. Hoel just named the problem. The coalition’s been in the room.

Until measuring architectures is foundational instead of boutique, capability papers and safety papers are nothing but hand-waving and stage magic. A result the audience can’t tell from the apparatus. Falsifiability collapse isn’t a benchmark problem. It’s a substrate problem we keep solving with prompts.

~ If you know someone who may enjoy reading this article, please share ~

Help support open, independent research

Discussion about this post

Ready for more?

Elicitation Theater in AI

Post-benchmark era has a measurement problem

** Cover image: AI generated **

Discussion about this post

Ready for more?

Cover image: AI generated