The Surface Lies

Your at-a-glance read of code quality was never reading quality — it was reading the author. Strip the author out and the instinct keeps firing, confidently, on nothing. The answer is the oldest one in the trade: own the production, not the appearance of it.

EP 2/6 June 12, 2026 ~10 min read

Listen:

Synthesized from the research corpus's Broken Instrument domain (automation-bias, fluency-and-trust, and review-signal lenses) and the Acceptance Rituals engineering-stamp lens. Every statistic carries its verified qualifier; the one bridge the evidence does not support is named as missing rather than implied. Source-reviewed, fact-reviewed, and gap-reviewed before publication.

You can tell good code from bad at a glance. Twenty-five years sands that instinct down to something that feels like perception — you open a file and you know, before you’ve read a line closely, whether you’re in safe hands. Except the glance was never reading the code. It was reading the author.

That sounds like a clever inversion. It isn’t meant to be clever; it’s meant to be exact. The signals your fast read actually ran on were signals about the person and the history behind the diff — and the agent strips every one of them out while leaving the surface that fooled you into thinking you’d read the work.

A worn round gauge in a heavy pitted steel bezel on a cool graphite bench, its needle resting confidently mid-dial — but the dial face is a swatch of woven upholstery fabric instead of a calibrated scale, raked by a single cool inspection lamp.

What the glance was actually reading

The instrument reads the upholstery, not the substrate. The needle sits confidently mid-dial over a face that was never a scale.

When researchers go looking for which review signals genuinely predict where the defects are, the answer is not the ones that feel like judgment. It’s the ones that are really metadata — properties of the code’s history and its authorship, not of the static diff in front of you.

Relative churn — how much a file has been rewritten over its life — discriminated fault-prone code at 89.0% in the study that anchors the finding (Nagappan & Ball, Microsoft Research, 2005). Author familiarity did similar work: peripheral contributors to a file had their changes scrutinized 2 to 19 times longer than the regulars (Bosu, 2014), and a reviewer’s prior experience with a file lifted the density of useful review comments from 66% to 74% (Bosu et al., 2015). Those are the instruments that fired when you opened a file and knew. Not “this code is clean.” More like: I know this file, I know who touched it, and the history isn’t screaming at me.

The signals that feel like quality assessment — low cyclomatic complexity, the absence of code smells — turn out to carry almost no independent predictive weight. Complexity adds nothing beyond raw size — it tracks line count in a “stable practically perfect linear relationship” and “has no (or very little) explanatory power of its own” (Landman et al., 2014); smells, once you control for size, move fault-proneness by a margin one large study put “always under 10 per cent” (Hall et al., 2014). The reassuring surface was never the instrument. It was the upholstery.

Here is the trap. Machine authorship erases the metadata — there is no churn history on a file the agent just generated, no familiar hand to recognize, no contributor reputation to weigh — and it manufactures the upholstery for free. Clean structure, consistent naming, no smells. The agent produces exactly the surface that your instinct misread as quality, and removes exactly the substrate your instinct was actually reading.

The verified part is which signals predict defects; that the machine erases precisely those is the most defensible reading of it — interpretation, and I’ll keep it labeled.

The instinct fires on nothing

Strip out the author and the instinct doesn’t stop. It keeps firing — on the surface, confidently, on nothing.

We have decent evidence for the confidence half. People over-trust fluent, assured machine output independent of whether it’s correct. In one study participants endorsed a chatbot’s incorrect first answer in 9 of 13 cases — roughly 70% — and fluent explanations raised acceptance of wrong answers (arXiv 2502.08554). Reliance tracked the linguistic confidence of the output rather than its accuracy; the combined incorrect-reliance figure in that line of work reached 54.95% (arXiv 2507.06306) — and that number is for one model–language pairing (Llama-3.1-8B, Japanese), not a general rate of how often anyone trusts anything. It travels with that qualifier or it doesn’t travel.

The deeper mechanism comes from forty years of aviation and medical human-factors research, and it’s the most useful thing in this piece. Complacency toward automation is driven not by how reliable the automation is but by how consistent its reliability is. Monitors watching a system whose reliability varied caught its errors about 75% of the time. Monitors watching a system that was reliable at a steady, even rate caught them about 30% of the time (Bagheri & Jamieson, via PMC4221095). The steadiness itself disengaged the supervisor — from exactly the errors the supervisor was there to catch.

That puts the danger zone in the high-and-consistent range — and a coding agent, useful most of the time at a steady rate, appears to sit right in it. Though “sit” is interpretation laid on the measured 75-versus-30 result, not a number anyone recorded at a real desk. The mechanism is solid; the placement is a reading.

So the agent is the textbook setup for both failures at once: fluent and confident enough to trigger the trust reflex, consistent enough to trigger the complacency reflex. “It’s usually right” is not the all-clear. It’s the warning.

The author you can’t even see

There’s a final turn, and it’s the one I find hardest to sit with.

Humans cannot reliably tell that code was written by a machine. Asked to distinguish AI-authored code from human-authored code, people perform at roughly chance — about 50% (arXiv 2411.04704). A trained classifier, on the same task, reaches an AUC of 0.91 (Idialu et al., arXiv 2403.04013). The detection signal is real and strong; it just isn’t available to the human eye. (That 0.91 is an authorship classifier — whether the code was AI-written. It is not a measure of whether the code is any good, and not to be confused with any other 0.91 floating around the literature.)

It gets worse than not noticing on demand. In one study around 80% of developers didn’t realize the code in front of them was AI-generated at all — “almost 80% (11/14) of them reported they did not realize it” (Tang et al., 2024, arXiv 2405.16081) — provenance-blindness, not provenance-misjudgment. The thing your fast read most depended on — who wrote this — isn’t just degraded by machine authorship. It’s invisible. You can’t read the author through the surface because the surface no longer has one, and you can’t even tell that’s the situation you’re in.

This is the part to laugh at rather than mourn, because the alternative is insufferable. The sharpest tool I owned — the read that felt like seeing the quality of a thing directly — turns out to have been an author-detector the whole time, running on churn and familiarity and the faint signature of a careful colleague. Of course it was. It worked for twenty-five years because for twenty-five years there was always an author. Now there isn’t, and the tool fires anyway, on the upholstery, telling me everything’s fine.

That “your glance read the author, not the quality” framing is itself interpretation — the best available reading of the metadata-signal and surface-null findings, not settled cognitive science. Good enough to act on, though, and the action is the rest of the trade.

A heavy worn engineer's embossing seal in long iron tongs resting on a sheet of paper, the sheet carrying a single raised embossed roundel struck into its corner that catches the one warm bronze note, lit by a cool inspection lamp on a graphite bench.

The first master: own the production, not the artifact

The first master's seal. What it attests is responsible charge — control exercised throughout the work — not a stamp applied at the end.

So what replaces a read that no longer reads anything? The answer is older than software, and the profession that worked it out most rigorously is structural engineering — where someone has to put a name on a drawing and be accountable when a building falls down.

The structural engineer’s seal looks, from the outside, like a stamp of approval applied at the end. It isn’t. What the seal legally attests is responsible charge, and the regulation is specific about what that requires (NCEES Model Rules §240.20 E). The licensee in responsible charge must:

Have and exercise the authority to review and to change, reject, or approve both the work in progress and the final work product, through examination, evaluation, communication, and direction throughout the development of the work.

Read the operative phrase: throughout the development of the work — not a sign-off at the end, but control exercised while the thing is being made. The profession states the converse just as plainly: reviewing drawings after preparation, without involvement in their development, does not satisfy responsible charge. By the seal’s own logic, then, “I reviewed the PR” is not ownership. Reading the finished artifact — which is exactly what the broken glance was doing — is the thing the strongest acceptance ritual any profession ever built explicitly rules insufficient. Ownership is steering the production: setting the design concept, the constraints, the rejection criteria, before the work is generated — and being able to answer detailed questions about it afterward in enough depth to demonstrate you actually controlled it.

The case that made this concrete is the 1981 Kansas City Hyatt Regency walkway collapse. The engineer of record, Jack Gillum, hadn’t personally drafted the fatal connection detail — a revised hanger-rod design that left the beams able to carry only about 30% of the load the building code required. The project engineer who handled the drawings, Daniel Duncan, never reviewed the shop drawings either. Gillum was held liable anyway. The court found him “vicariously liable and responsible for the acts and omissions of Duncan which liability and responsibility he assumed by affixing his professional engineering seal on the structural drawings,” and “grossly negligent in failing to himself review or assure that someone had reviewed” the drawing before he sealed it (Duncan v. Missouri Bd., 744 S.W.2d 524). Liability traced through the seal, not through authorship. Gillum didn’t draw the detail, and didn’t personally review it. He owned it anyway, because his seal was on it. That was the whole point of the seal.

Note the precise shape of what the seal punishes, because it matters for the analogy. The enforced offense is sealing work you did not actually review — “plan stamping,” in the profession’s term. It is not strict liability for every latent flaw that survives an honest review. Gillum was sanctioned for grossly negligent sealing of un-reviewed defective work, not for the bare act of stamping something that later proved imperfect despite genuine control. The stamp attests accountability, not perfection — and not authorship.

The profession has already pointed this rule directly at our problem. The American Society of Civil Engineers adopted Policy Statement 573 in July 2024, holding that AI “cannot serve as a replacement for the professional judgement of a licensed Professional Engineer” and, more pointedly, that “AI cannot be held accountable.” Which is the line to carry out of this piece: an agent wrote it is exactly as exculpatory as the steel fabricator proposed it. Which is to say — not at all. The fabricator proposing a detail doesn’t move the accountability off the engineer who sealed it. The agent generating a diff doesn’t move it off you.

The bridge the evidence doesn’t build

I owe you one honest gap before the close, because the cleanest version of this argument rests on a study that doesn’t exist.

The tidy claim would be: polished code specifically fools you — that clean formatting and consistent naming, with correctness held constant, raise your judgment of quality on their own. It’s a believable mechanism. It’s the one this whole piece leans toward. And no controlled experiment has demonstrated it. Nobody has manipulated code polish while holding correctness fixed and measured the effect on a reviewer’s quality judgment. The cleanest dissociation anyone can point to is a single indentation study from 1983 — Miara et al., forty-three years old — and the modern formatting literature that surrounds it is a thin, “null and contradictory” systematic review of just fifteen studies (Oliveira et al., arXiv 2208.12141). The modern headlines claiming reviewers’ perceptions have “decorrelated” from actual quality are mostly vendor content, and they remain claims, not findings.

So I’ll say plainly what I can and can’t support. The mechanism — fluency pulls trust, consistency breeds complacency, the validated signals are metadata the machine erases — is well-evidenced. The specific bridge from clean-looking code to misjudged quality is believed, not demonstrated. We think the experiment would confirm it. The experiment hasn’t been run. A piece arguing that surfaces lie has no business pretending its own weakest link is solid.

And the answer survives the gap regardless. Re-derive the intent the agent was supposed to be exercising — rebuild the design and the constraints yourself, because no author’s head holds them anymore — and own the production rather than the artifact. That discipline doesn’t depend on the missing study. It depends on the part that is solid: the glance reads a surface, the surface no longer has an author behind it, and the name on the line is still yours.

The surface lied about what’s present. The next break is worse: the surface can’t show you what’s absent — the permission check never written, the edge case never handled. That’s The Thing That Isn’t There, and the answer comes from two trades built entirely around hunting what’s missing.