You Can't Read It All

There's a size past which review stops working by construction, and the agent hands you that size on demand. So stop pretending you'll read it all. Decide in advance, and in writing, what gets read deeply and what gets sampled — keyed to the cost of being wrong, not the length of the diff.

EP 4/6 June 12, 2026 ~13 min read

Listen:

Synthesized from the research corpus's Verification Economics domain (inspection-history, review-at-scale, and triage-strategies lenses) and the Acceptance Rituals inspection-and-audit lens. The borrowed masters — audit materiality and aviation's Required Inspection Items — are described as they actually operate; the exact line-count thresholds carry their single-vendor flag, and the calibration for code is named as data-not-found. Source-reviewed, fact-reviewed, and gap-reviewed before publication.

There’s a number of lines past which review stops being thorough and starts being theater. You already know this in your gut — you’ve felt attention go slack somewhere in the third file of a large pull request, the eyes still moving while the comprehension quietly clocked out. The agent doesn’t have that ceiling. It will hand you eight hundred lines across nine files as easily as eight, and it will do it again an hour later. So the honest question at handback isn’t how do I read all of this carefully. It’s which parts am I going to read carefully, and which parts am I deciding, on purpose, not to.

That decision is the whole piece. And the trades that solved it long ago made it the same way every time: in advance, in writing, before the work showed up — keyed to what it would cost to be wrong, never to how big the thing was.

A worn brass-and-steel balance scale on a graphite bench, one pan loaded low and heavy with weights and the other empty and high, lit by a single cool inspection lamp — weighing consequence, not size.

The auditor never tried to read it all

Materiality isn't a size cutoff. The scale weighs consequence — would getting this wrong change what someone does about it.

Start with the profession that faces your exact problem at the largest scale and is the least sentimental about it. An auditor signing off on a public company’s financials cannot re-examine every transaction — there are millions — and nobody pretends otherwise. So the audit is built, from the first page, around not looking at everything. The discipline isn’t in the looking. It’s in the deciding-what-to-look-at, and that decision is made before the fieldwork starts and written down.

The word for the threshold is materiality, and the thing worth stealing is what it’s keyed to. Materiality isn’t a size cutoff. It’s a decision-impact test — a misstatement is material if there is “a substantial likelihood that the … fact would have been viewed by the reasonable investor as having significantly altered the ‘total mix’ of information made available” (PCAOB AS 2105.02). The auditor asks not how big is this number but would getting it wrong change what someone does about it. And the threshold tiers: sensitive accounts get lower ones, because the consequence of an error there is higher.

What the audit does next is the move I want you to take to the desk. It does not sample uniformly. Items that are individually significant don’t get a representative draw — they get full scrutiny (PCAOB AS 2315.21):

The auditor should examine those items for which, in his judgment, acceptance of some sampling risk is not justified. For example, these may include items for which potential misstatements could individually equal or exceed the tolerable misstatement.

Examine those items — not sample them. A hundred percent, no draw at all, for the things where being wrong would individually breach what you can tolerate. And the areas where the auditor expects trouble — fraud-prone domains — don’t get a denser random sample. They get targeted procedures aimed at the specific way that domain goes wrong, like journal-entry testing built “to test the appropriateness of journal entries recorded in the general ledger” (PCAOB AS 2401.58). The auditor’s answer to “what might be wrong in here?” is never “look harder at the whole pile.” It’s “decide in advance which categories of wrong would matter most, and aim directly at those.”

And — this is the part that gives the whole approach its spine — the profession names what it’s giving up. Sampling has a stated failure mode, written into the standard with no euphemism: the risk of incorrect acceptance, defined as “the risk that the sample supports the conclusion that the recorded account balance is not materially misstated when it is materially misstated” (PCAOB AS 2315.12). The auditor doesn’t pretend the sampled portion is verified. The auditor accepts a known, named, bounded risk on it — on purpose, in writing — so that the full attention goes where the cost of being wrong is highest.

That is the entire technique. Decide what’s significant before you start. Examine the significant fully. Sample the rest, and say out loud that you’re sampling it. Triage isn’t the auditor cutting corners. Triage is the craft.

A worn heavy stamped-steel aircraft inspection hang-tag on a short wire loop resting on a graphite bench, its single struck inspection roundel catching the one warm bronze note, lit by a cool inspection lamp.

Aviation runs the same logic on a single question

The Required Inspection Item. The tier is assigned by consequence — what happens if it's wrong — decided up front and written into the program.

The auditor keys the tiers to financial consequence. Aviation keys them to a sharper one — whether the work, done wrong, kills people — and the structure it built around that question is the cleanest template I’ve found for a handback.

A commercial maintenance program sorts work into tiers, and the top tier has a name: Required Inspection Items. An RII is any task that, in the regulation’s words, covers work that could result in “a failure, malfunction, or defect endangering the safe operation of the aircraft, if not performed properly” (14 CFR §121.369(b)(2)). That’s the whole test. Not how long the job takes, not how many parts it touches — what happens if it’s wrong. A two-bolt control linkage can be an RII. A panel replacement may not be. The tier is assigned by consequence, decided up front, and written into the program.

What sits on top of an RII is exactly the property the agentic handback destroys. The work and its inspection cannot be the same hands:

No person may perform a required inspection if he performed the item of work required to be inspected.

The one who did the work cannot be the one who signs that it’s right (14 CFR §121.371(c)). And the inspector who signs has a protected veto — the airline cannot route around a “no,” the way a developer can dismiss a reviewer’s comment and merge anyway. Independence here isn’t a courtesy. It’s load-bearing structure.

Aviation also did something the binary “critical or not” framing usually misses, and it’s the part most worth porting: it built a middle tier. Beyond the RII list and the routine work, there’s a category for tasks that aren’t life-or-death but have a track record of going wrong — the Maintenance Verification standard’s bucket for “tasks that have over time proven prone to human error but do not rise to the level of a required inspection item” (ATA Spec 108, via FAA Safety). Those get a lighter second look — a second signature — not the full RII treatment. Three tiers, not two: full re-inspection for the consequential, a lighter check for the error-prone, and a sign-off for the routine. The depth of acceptance is matched to the risk, and matched in advance.

Where the tiering breaks is the line itself

Here’s the warning that comes attached, and it’s the one I keep closest, because it’s the failure mode a senior under deadline is built to commit.

The aviation system, designed exactly right, has failed — and when it failed, it usually wasn’t because the inspection was sloppy. It was because the work never got routed to the inspection that already existed. The case most often pointed to is Continental Express flight 2574 in 1991: the NTSB account describes maintenance on the horizontal stabilizer where the carrier “maintained that the deice boot/leading edge assembly was a ‘non-structural’ item, and therefore not subject to the more rigorous inspection requirements” (AVM, “Shift Change Tragedy”). (I’ll flag that this account rests on reconstructed, image-only primary documents — I’m using it as an illustration of the failure mode, not as load-bearing proof of a number.) The inspection regime was intact. The plane still came apart in flight. What broke was the classification step — the decision about which tier the work belonged in.

That is the whole risk of triage, and naming it is what keeps triage honest. The danger isn’t that you’ll sample the cosmetic tier too lightly. It’s that you’ll mis-file a consequential change as cosmetic — that the eight-hundred-line diff will be ninety-five percent formatting and you’ll wave the whole thing through, never noticing the one buried hunk that changes who’s allowed to do what. A one-line permission change is structural-tier. A five-hundred-line reformat is not. Diff size tells you almost nothing about which tier a change belongs in, and the instinct to read by size is exactly the instinct that mis-classifies.

Which is why aviation’s deepest defense isn’t the inspection. It’s that the list is standardized and written down — the designation of what counts as an RII isn’t re-litigated per-job by whoever happens to be on shift. The triage rule has to be decided once, owned, and written, precisely so that the developer rationalizing a structural change as routine at the end of a long day doesn’t get to make that call fresh, alone, under the pressure that makes mis-classification feel reasonable.

Now the numbers, because they only confirm what the trades already knew

I’ve led with the auditor and the inspector on purpose, because the structure is the durable part and the figures are the perishable part. But the verification research backs the shape, and it’s worth seeing where it’s solid and where it’s thin.

The solid part first: review has a ceiling that isn’t a matter of discipline. Past a certain size, defect-detection effectiveness drops — larger changes get less effective review, not just slower review, and the effect is a property of human attention, not of how hard you try. And the changes that actually get reviewed well are small. At Google, “the median number of lines modified is 24” (Sadowski et al., 2018; a single-firm figure, but a real one). That’s the regime where review functions: small units, an engaged reader, context in hand.

Now the part to handle carefully. The most-cited exact thresholds — best detection below roughly two hundred lines under review, not to exceed about four hundred, in sessions under sixty to ninety minutes before reviewers wear out — come from a single vendor’s 2006 study of its own tooling (SmartBear/Cisco, Cohen 2006). I’ll say that plainly every time the numbers appear: those specific cutoffs are one company’s data on one company’s product. What’s been independently replicated is the direction — detection collapses as size and time climb — not the exact line where it collapses. So treat “two hundred lines” as a useful order of magnitude, not a law. The law is the slope, not the intercept.

And here the agent-era data lands the point. In one observational study of a little over forty thousand pull requests, agent-authored PRs received human review at 8.08 percent against 25.21 percent for human-authored ones — agent code getting roughly a third the review, not more (AIDev dataset, arXiv 2605.02273). In that same dataset, “each additional reviewer comment increases merge odds by 2.7% for human PRs, but decreases merge odds by 2.8% for agentic PRs” (arXiv 2601.18749) — review activity reading as a rejection signal for agent code rather than a refinement of it. (That under-review figure should stay distinct from a separately reported, much higher rubber-stamp rate from a different 2026 preprint; I’m citing the forty-thousand-PR observation, not fusing it with the other.) The machine produces more code, faster, in larger units — and the larger units are getting less scrutiny at exactly the moment the size makes scrutiny harder. The ceiling and the volume are moving in opposite directions.

The one number that names both the payoff and the limit

There’s a single finding from the triage research that I want to set down slowly, because it’s where the optimistic and the sober readings of this whole approach turn out to be the same sentence.

Researchers modeled an effort-aware way of ordering a review queue — spend your attention where the validated risk signals point — and found that “using only 20 percent of the effort it would take to inspect all changes, we can identify 35 percent of all defect-inducing changes” (Kamei et al., TSE 2013). Read one way, that’s the triage payoff: a fifth of the work, better than a third of the danger. A real return on deciding where to look instead of looking everywhere.

But it is the same finding, seen from the other side, that says: even spent well, that effort leaves about sixty-five percent of the defect-inducing changes uncaught. That isn’t a second, scarier statistic. It’s the same number’s honest face. Triage doesn’t close the gap — it lets you choose which part of the gap you’re accepting, the way the auditor chooses which population to sample and says so. The thirty-five you catch and the sixty-five you don’t are one decision, made on purpose. The point of tiering was never to verify everything. It was to put your full attention on the consequential changes and accept a named, bounded risk on the rest — knowing the rest is most of it.

That’s not a flaw in the method. It’s what the method is for. The auditor’s “risk of incorrect acceptance” and this sixty-five percent are the same admission in two vocabularies: you are signing for a thing you did not fully verify, you knew it when you signed, and you decided in advance where the unverified part would be allowed to live.

One more honesty note, because it cuts both ways. The thresholds aren’t the only thing worth distrusting — so are the headline volume claims, from both directions. The vendor numbers saying AI now writes forty-odd percent of code are as owned-by-the-seller as the alarmist ones; the strongest independent estimate of AI’s share rose “from around 5% in 2022 to nearly 30% in the last quarter of 2024” (Daniotti et al., Science 2026, via TechXplore) — roughly half the brochure figure, and itself only a reported result. A vendor claim of PRs up three-hundred-odd percent is exactly as much a sales document as any doom statistic. Triage from the sober independent facts. Distrust the brochure whether it’s selling you a boom or a crisis.

A worn metal sorting tray on a graphite bench divided into three plain compartments of different depths, steel discs sorted unevenly across the three tiers, lit by a single cool inspection lamp.

What this asks of you

Three tiers, decided in advance and in writing: full re-inspection for the consequential, a lighter check for the error-prone, a sign-off for the routine.

So the discipline is unglamorous and it is written down before the diff arrives, not improvised when it does.

You decide, in advance, what your structural tier is — the changes where being wrong is expensive enough that you read them in full, every line, no sampling: anything touching auth, money, data integrity, deletion, the public contract of an interface, the security boundary. You decide what your routine tier is — the changes a passing build and a spot-check can carry. And you give yourself a middle tier for the error-prone-but-not-catastrophic, the way aviation did. Then you key the tier to consequence of an undetected error, not to the line count, and you write the rule down so that the version of you that’s tired at the end of a long session doesn’t get to quietly re-file a structural change as routine.

There’s a seam in this. The trades give you the structure — the tiering, the consequence test, the hundred-percent examination of significant items, the named acceptance of sampling risk. They do not give you the calibration. There is no published materiality threshold for code, no validated sampling rate, no evidence-backed rule for where the structural line falls. Audit materiality has, as far as the literature shows, never been ported to software at all — it’s a data-not-found, and I won’t manufacture a number the evidence won’t carry. The structure transfers. The constants don’t. You set the constants yourself, for your system, and you own them.

Which is, in the end, the trade again. There was never going to be a rule that read the diff for you. The auditor signs the opinion knowing a sampled ledger could still hide a fraud. The inspector signs the airframe knowing the regime catches what it was routed to catch and nothing else. They sign anyway — not because they read it all, but because they decided, in writing and in advance, what reading it all would have meant, did the part that counted, and named the part they let go. You stop pretending you’ll read it all. You decide what you won’t. And the name on the line is yours for both halves of that decision — the part you verified and the part you chose not to.

The ceiling is real, and triage is how you live inside it. But triage assumes you can actually re-derive the consequential changes when you decide to read them deeply — and the agent handed you a flat diff on a cold desk, with the environment it ran in thrown away. The next break is the bench itself. That’s The Cold Desk, and the answer is a handback contract: the older trades simply refuse to accept work that doesn’t arrive in inspectable condition. (For the break before this one — why an absence has no surface for your eye to land on — see The Thing That Isn’t There.)