What We Can't Tell You Yet
A series about accepting work you can't fully verify owes you the same honesty about itself. So here is what we looked for and didn't find — and why the gaps don't dissolve the argument. They are the argument.
Synthesized from the research corpus's unusually rich 'What This Research Cannot Tell You' findings across all five domains, turned from disclaimer into subject. Every gap named here was verified as a gap — confirmed absent from the literature, not merely unlooked-for. This closer needs no contested statistic to stand; it is built from what the corpus could not find. Source-reviewed, fact-reviewed, and gap-reviewed before publication.
There is no rule for verified enough. There was never going to be.
I want to start there because it’s the floor under everything that follows, and because a series about accepting work you can’t fully verify owes you the same honesty about itself. Five pieces of this collection argued that the agent hands you legacy code at birth and that an old trade answers it. This one is the bill for that argument — the part where I tell you, plainly, what I went looking for and couldn’t find. Not to undercut the case. Because the gaps turn out not to be holes in the case. Several of them are the case.
A note before the list, because the list is the easy thing to misread. “We couldn’t find it” is not the same as “we didn’t look,” and it is not the same as “therefore the claim is shaky.” Each gap below sits next to a claim that is well-supported — and the gap is a specific, bounded thing the evidence can’t yet say about that claim. The shape that keeps recurring is: the direction is solid, the magnitude is missing. We know which way the arrow points. We can’t tell you how long it is. That distinction is the whole discipline, so I’ll hold it the whole way down.
What I told you in each piece, and what I still can’t
The surface
The first break argued that your at-a-glance read of code quality was never reading quality — it was reading the author, through signals like churn history and a familiar hand, the substrate machine authorship erases while manufacturing the clean surface for free. The mechanism is well-evidenced: fluency pulls trust, steady reliability breeds complacency, the validated review signals are metadata.
What I can’t tell you is whether the polish itself moves your judgment. The clean version of that argument needs a study that holds correctness constant, varies only the formatting and naming, and measures whether a reviewer rates the tidier code higher. Nobody has run it. The cleanest dissociation anyone can point to is a single indentation study from 1983 — forty-three years old. The modern headlines claiming reviewers’ perceptions have “decorrelated” from real quality are mostly vendor content, and they stay claims, not findings. So: the surface reads as quality because polish, for most of your career, tracked a careful author — and I cannot prove that polish alone, with the author subtracted and correctness fixed, moves the judgment. I believe it would. Belief isn’t the experiment.
The absence
The second break argued that review is near-blind to absences. You can catch the wrong line; you slide right past the missing one — the permission check never written, the guard for the case nobody put in front of you — because an absence has no surface for your eye to land on. That direction is one of the best-supported things in the whole project: a replicated cognitive mechanism (omission neglect, observed across species), a defect taxonomy that makes “missing” a top-level class, and agents with a documented lean toward exactly those omissions.
What I can’t tell you is the rate. There is no measured figure for how often a reviewer misses an absence versus a wrong line — no study that put reviewers in front of code, unprompted, and counted. I went looking; it isn’t there. And I have to guard one number especially hard, because it’s the one that would love to fill the hole: Fry and Weimer’s finding that people are “over five times more accurate” at locating extra statements than missing ones. That is a localization floor — how much harder it is to point at a missing statement once you’ve been told one exists. It is not a review-detection rate, and I will not let it become one. Review is near-blind to absences, and we still can’t tell you the rate at which it misses them. A piece about the danger of missing numbers had no business inventing its own.
The volume
The third break argued there’s a size past which review stops working by construction — attention has a ceiling, larger changes get less effective review, and the agent hands you that size on demand. The answer the trades supply is triage: decide in advance and in writing what gets read in full and what gets sampled, keyed to the cost of being wrong. The auditor’s materiality, aviation’s Required Inspection Items, the 100% examination of individually significant items — that decision structure is verbatim-verified and centuries old.
What I can’t tell you is where to draw the line for code. There is no published materiality threshold for software, no validated sampling rate, no evidence-backed rule for where the structural tier ends and the routine one begins. Audit materiality, as far as the literature shows, has never been ported to code at all. The structure transfers; the constants don’t. You set the constants yourself, for your system, and you own them — which means the most useful thing the trades gave us comes with the calibration deliberately blank.
The bench
The fourth break argued that the agent ran your code in a warm sandbox — state seeded, intent fresh, the run still in the buffer — then handed you a flat diff and threw the world away, leaving you to rebuild it on the coldest desk in the building. The asymmetry is real and the handback contract answers it: demand that the work arrive runnable, with provenance, with an auditable run-log, because the agent already built all of that and discarded it at the boundary.
What I can’t tell you is whether arriving warm makes you a better acceptor. There is no study isolating whether a runnable, warm-start handback improves what a reviewer catches over a flat diff — not in favor, and not against. It’s data-not-found in both directions. Worse: there’s a coherent argument it could hurt — a runnable, one-command-green surface is one more fluent thing inviting you to relax, and fluency is exactly what triggers the over-trust reflex the first break was about. Warm-start helping is a hypothesis I’d like someone to test. The contract stands without it, because it rests on a fact that needs no study — the artifacts existed and were thrown away — but the headline cognitive benefit is unproven and might invert.
The cut
There’s one more, and it’s the gap I find most instructive, because it looks at first like evidence and turns out to be a lesson about evidence.
The intuition the whole series leans on is that cutting the independent acceptance layer raises error — pull out the second set of eyes and more defects ship. It’s the foundation under “independence is load-bearing.” So I went looking for the clean before-and-after: cut the layer, watch error climb. The closest natural experiment anyone has run is in journalism, when newsrooms cut their copy desks — and it does not say what you’d expect. Hettinga and Smith’s content analysis of New York Times corrections around the 2017 desk cut (1,149 corrections) found the paper published more corrections before the cut than after (Hettinga & Smith, 2021). Martin and Martins, comparing five papers before and after outsourcing copyediting, found “no uniform increase” — corrections fell at one paper, rose at another, unchanged at three (Martin & Martins, 2016, via Poynter).
Read naively, that’s a refutation: cut the catchers, errors go down. Read correctly, it’s the sharpest point in this piece. “Printed corrections” counts errors that were caught and logged after publication — and the people who caught and logged them were the ones you cut. Remove the catchers and the metric goes quiet, not the errors. The dashboard you’d watch for a spike is the dashboard the layer was feeding. So whether cutting the independent layer raises actual, uncaught error is data-not-found — the only thing the literature offers is nulls on a proxy that goes silent for exactly the reason you’d most want it to be loud. A team that drops the human acceptance pass and watches its bug count for a spike may be reading precisely the wrong signal.
The gap the whole project sits in
Those five are the seeded ones — each piece left its own number on the table, on purpose. There’s a sixth I’ve left for last because it’s larger than any of them, and the honest thing is to say it sits under the entire collection.
There is no clean study comparing how senior engineers read agent-authored code against human-authored code, matched for difficulty. That’s the experiment this whole project would lean on, and it doesn’t exist. The comprehension canon I drew on — what understanding code requires, how reviewers build a mental model, where attention fails — was measured on human-written, pre-agentic code, mostly with student subjects, mostly on short programs. Every cost figure, every detection rate, every “this is how review works” finding in the corpus comes from that world. I’ve spent six pieces arguing that agent code is a different animal at the handback, and the measurements I argued from were taken before the animal existed, on a different one, by people who weren’t senior.
I’m not going to dress that up. It means the quantitative spine of this series is borrowed from an adjacent regime and pointed at a new one on the strength of mechanism and analogy, not direct measurement. The mechanisms are real and the analogies are disciplined. But the matched experiment — seniors, agent code versus human code, same difficulty, measured — is the thing the field most needs and the thing I most wanted and the thing that isn’t there.
That’s not a footnote to the argument. For a series whose first piece called itself an act of accepting agent output under uncertainty, it’s the argument running its own thesis on itself.
The thing we tried that doesn’t work
Before the turn, one honest dead end — because what failed is as load-bearing as what held, and this particular failure is the reason this project’s whole method exists.
If review slides past polished code because the polish goes down too easy, the tempting fix is to make the diff harder to read. Degrade the legibility, force slower processing, and the reasoning supposedly gets deeper. There was a famous result behind that — present material in an ugly, hard-to-read font and people reason more carefully. It’s the kind of clever lever you’d want to reach for here.
It doesn’t replicate. When the large coordinated replication effort tried it across thousands of subjects, the disfluent-font effect came back at effects indistinguishable from zero — point estimates around −0.01, −0.05, −0.03 across sites (Many Labs 2) — and a separate careful study found “no effect on solution rates” (Meyer et al.). The clever lever is a famous-but-failed finding, and reaching for it would have been exactly the mistake the methodology behind this corpus was built to catch. I’m naming it because the temptation was real and the evidence killed it: you do not fix the surface problem by degrading the surface. The structure that actually works has to surface intent — make the reasoning the agent exercised visible and checkable — not make the artifact harder to read. Re-derivation, not friction. The dead end is part of the map.
When the rules run out
So gather it up. No proven mechanism for polish fooling you. No measured rate for missed absences. No materiality threshold for code. No evidence that a warm bench sharpens your eye. No clean signal that cutting the independent layer raises real error. No matched experiment on how seniors read what the machine writes. That’s a lot of “we don’t know” for a series that spent five pieces sounding sure.
Here’s the turn, and I’m going to refuse to let it land as either a confession or a victory lap, because it’s neither. When you assemble every gap, what you’re left holding is not a weaker argument. It’s the actual one, stripped of the pretense that a rule was ever coming to save you. There is no threshold that reads the diff for you. There is no checklist that converts into certainty. The borrowed trades all knew this; it’s why they’re worth borrowing from. The auditor signs the opinion knowing the sampled ledger could hide a fraud. The inspector signs the airframe knowing the regime catches what it was routed to catch and nothing more. The attending signs the chart knowing the only signature worth anything is the one that re-performed the work. None of them is waiting for a rule that eliminates the judgment. The judgment is the job.
That’s what remains when the rules run out: a person’s risk judgment under irreducible uncertainty, owning a named decision. Not the gesture of acceptance — the substance of it. “An agent wrote it” is exactly as exculpatory as “the fabricator proposed the detail” was for the engineer who sealed the Hyatt walkway (Duncan v. Missouri Bd., 744 S.W.2d 524). Which is to say, not at all. He didn’t draft the fatal connection; he owned it, because his seal was on it. The accountability traced through the signature, never through authorship. That is the trade in one sentence, and it does not get easier or softer because the thing you’re signing for came out of a model instead of a junior’s hands.
And here is the part that keeps this from being a clean inspirational landing, because nothing about this is clean. Every one of these rituals decays. The professions admit it themselves: the strongest acceptance ritual anyone ever built, the structural engineer’s seal, was tested against volume and bent — the board first ruled it unethical to seal work you hadn’t reviewed in detail (NSPE Case 86-2), then relaxed the standard, in its own words, “to reflect actual practices” (NSPE Case 90-6), because the volume made the ideal impossible to hold. Medicine found that merely adding a supervising signature, without re-deriving the work behind it, bought no measurable reduction in error (Finn et al. 2018). Under enough volume and deadline, every ritual drifts toward the cheap signature — the name applied without the reconstruction it’s supposed to certify. How often that actually happens, in any of these trades, is itself data-not-found; the prevalence of the rubber stamp has never been cleanly measured anywhere. We know the decay exists because the professions confess it. We can’t tell you its rate either.
The work is refusing that decay. Not abolishing it — you can’t, the pressure is structural — but refusing it, transaction by transaction, the way you refuse it every time you re-derive a change instead of admiring its surface. That refusal is not a gap in the craft. It is the craft. The whole trade, across all five professions and twenty-five years and a thesis I’ve labeled interpretation from the first paragraph, comes down to a person declining to let the signature go cheap on the one diff in front of them, knowing they’ll have to decline it again on the next one, with no rule promising they’ll get it right.
I should close where the opener opened, because the two bookends are the same honesty pointed in opposite directions. The first piece named the trade and made a claim — agent output is legacy code at birth — and I told you then it was the interpretation this series sets out to defend, not a measured finding, and that I’d mark the line every time it mattered. I’ve marked it. It’s still interpretation. Six pieces of argument do not promote a claim to a fact by repeating it, and I’d be doing the exact thing I warned against if I let the volume of my own words stand in for evidence.
What I can tell you is the thing I could tell you on the first page, and it hasn’t moved. This series was written by agents and accepted, line by line, by a human who put his name on it — the thesis enacted on itself. Where the number my own argument wanted didn’t exist, I told you, plainly, six times. That wasn’t a confession and it wasn’t humility for its own sake. It was the trade, run on this artifact, in front of you. I accepted work I couldn’t fully verify, owned the decision to publish it, and marked every place I was unsure — because the alternative was to argue that you should treat agent output as legacy code at birth while quietly pretending mine was something better.
The agent hands back clean, passing, authorless code, and the relief when the diff goes green is real and means nothing. You read past it. You re-derive what the machine couldn’t hand you. You decide, in advance and in writing, what you’ll verify and what you’ll let go. And then — knowing there’s no rule for enough, knowing the warm bench might not help, knowing the matched experiment doesn’t exist, knowing the metric goes quiet right where you’d want it loud — you sign. Not because you read it all. Because the name on the line is yours, it was always going to be yours, and the whole of the trade is in signing anyway, on purpose, with your eyes open.
This is the last piece. The frame it closes is Legacy Code at Birth — the opener that named the trade and made the claim this series spent six pieces defending and never once promoted to a fact. The opener names the trade; this closer admits its limits. Same honesty, opposite ends of the bench.