The Cold Desk

The agent ran your code in a warm sandbox with the state seeded and the intent fresh, then handed you a flat diff and threw all of it away — leaving you to rebuild its world on the coldest desk in the building. The answer is a handback contract: the older trades refuse to accept work that doesn't arrive in inspectable condition. This is the most hedged piece in the series, and it says so.

EP 5/6 June 12, 2026 ~15 min read

Listen:

Synthesized from the research corpus's Reviewer's Workbench domain (feedback-latency, acceptance-tooling, and isolated-environments lenses) and the Acceptance Rituals medical-countersignature lens. This is the project's heaviest vendor zone; every breakage claim and latency figure carries its vendor flag, the central remedy is named as a discipline-and-hypothesis rather than a proven outcome, and the causal link the whole domain leans on is reported as data-not-found. Source-reviewed, fact-reviewed, and gap-reviewed before publication.

Generation got fast environments. Acceptance got a flat diff and a stale branch.

When the agent wrote the change, it had a whole world to work in. A container booted to your project, the dependencies built, the database seeded with state it could poke at, the test it just ran still warm in the buffer — and, underneath all of it, the intent: the plan it was working from, the constraints it was holding, the reasons it picked this approach over the other one. It ran the code. It watched the code run. Then it serialized the result down to a unified diff, handed you that, and threw the rest away. You inherit the artifact and none of the workshop. You are reconstructing the agent’s whole world on the coldest desk in the building.

That is the break this piece is about — the bench, not the code on it. And I’ll tell you now, because the plan I’m working from is blunt about it and I’d rather be blunt too: this is the most hedged piece in the series. The asymmetry is real and I can show it to you. The fix I’m going to argue for rests, in part, on a claim nobody has measured in either direction. I’m going to keep those two things separate the whole way through, because the moment they blur is the moment I’ve sold you something the evidence doesn’t carry.

A cold, bare inspection workbench empty except for a single thin flat folded sheet lying alone in the middle — the flat diff handed over and abandoned — with a worn bench vise clamped at one edge, lit by a single cool inspection lamp.

The tax is real, and it lands before any test runs

The coldest desk in the building. The agent kept the warm workshop and handed you a flat diff on a stale branch.

Start with the part that’s solid, because it’s solid and it’s worth seeing clearly. The cost of bringing a real system up to a state where you can run it — not write code, not read code, just get the thing breathing so you can exercise the change — is large, and it’s mostly setup, not execution.

The cleanest public measurement comes from Shopify’s CI team, who decomposed where their pipeline’s wall-clock actually went. Preparing the build agents took around 31% of CI time; building dependencies took 37%; combined, in their words, “68% of the time in CI was spent just on overhead before we actually ran any test” (Shopify Engineering, 2021). Roughly two-thirds of the clock, gone before verification even starts. That’s a self-reported, unaudited practice report — take the exact percentage as one team’s accounting, not a law. But the shape is corroborated everywhere you look: the expensive part of a real system isn’t the test, it’s getting to the test.

And the floor stays high even after years of elite investment. After dedicated work by teams with infrastructure staff most of us don’t have, the realistic settling point for a mature monolith sits in the tens of minutes: Shopify’s CI p95 landed at about 18 minutes (Shopify, 2021); Uber’s Go monorepo post-rollout p95 wait came to 18.64 minutes (Uber Engineering, 2025); a well-tended Spring Boot averaged 30 minutes of CI even after a tenfold improvement in local build time (Gradle, 2023). Two of those three are vendor content, one is an engineering-brand blog, and they’re all survivorship — only the wins get written up; nobody blogs the caching project that regressed. So read them as existence proofs of a floor, not as benchmarks. The durable claim is the order of magnitude: bringing a serious stateful system up cold is a tens-of-minutes proposition, and most of that is setup you pay before you learn anything.

The link I can’t draw for you

Here’s where I have to stop and mark a line, because the obvious next sentence is the one I’m not allowed to write.

The obvious sentence is: and so, under that tax, the reviewer skips running it. You feel that’s true. I feel that’s true. The deadline is real, the boot is twenty minutes, the diff is green, and the temptation to just read it and approve is exactly the temptation a tired senior is built to give in to. But “latency causes skipped verification” is not a thing anyone has measured. What the literature actually has is the inner-loop tax degrading engagement — developers wait, lose context, fragment their attention, switch tasks (getDX, 2026, a framework vendor describing its own dimension) — plus a single-organization correlation that the people who flagged long builds as a “bad day” factor had meaningfully longer mean build times (49.74 vs 36.65 minutes, p << 0.05) (arXiv 2410.18379). That’s a correlation with worse days. It is not a measurement of verification being dropped because the environment was cold.

So I’ll say it flat: the link from a slow environment to a skipped check is practitioner consensus and adjacent correlation, not a measured causal finding. It’s the load-bearing assumption of this entire problem area, and it’s data-not-found. I believe it. I won’t pretend that believing it is the same as knowing it. A piece that’s about reconstructing a missing world has no business being sloppy about what’s missing from its own evidence.

The asymmetry: one party got the warm bench

Set the absolute numbers aside, then — vendor-chosen, mostly — and look at the relative shape, because that’s the durable part and it’s the part that should bother you.

The agent did not pay the tax you pay. Its execution context is engineered for exactly the opposite of your cold start. The sandboxes these platforms run on boot in well under a second: E2B advertises sandboxes that “start in less than 200 ms,” Daytona claims “under 90ms from code to execution” (e2b.dev; daytonaio/daytona). Both of those are vendors selling agent infrastructure, with no independent benchmark behind the digits, so don’t anchor on 90 milliseconds versus 200 — anchor on the category difference: the agent works in a disposable, near-instant, self-contained box, and you work against a system that takes tens of minutes to wake up. The reason isn’t that the fast engineering was withheld from you out of spite. It’s that the agent’s workload is small, stateless-at-start, and self-contained — the easy case — while the reviewer of any real stateful system inherits the seed-data problem, the services-times-open-PRs cost, the migrations and event flows that resist being stood up on demand. Same milliseconds-fast substrate; entirely different problem on each side of the handback.

The sharper edge of the asymmetry isn’t speed, though. It’s that the agent’s environment was already specified — and then discarded. OpenAI’s own documentation describes Codex cloud creating a container, checking out your repo, running your setup script, and caching the resulting container state to resume later (OpenAI Codex docs). That’s a vendor documenting its own product, so weigh it as such — but the structural point is hard to argue with: the agent already constructed a container image, a setup sequence, and a state snapshot. It had the runnable world. It threw it away at the boundary and handed you a diff. Spring-2026 tooling has started exporting these — but author-to-author, back to the same person mid-task, not to an independent acceptor doing the cold post-handback read (Codex docs). The world existed. It just wasn’t handed to you.

And it isn’t only the environment that gets dropped. The provenance goes too — the plan, the prompt, the requirement the change was meant to satisfy — and it was already going missing long before agents arrived. In a study of 80,000 pull requests across 156 projects, the purpose of the change appeared in only about 55% of PR descriptions and a link to the motivating issue in under half (45.16%); a separate finding the same work cites puts the share of PRs with no description at all around a third (Pirouzkhah, Wurzel Gonçalves & Bacchelli, arXiv 2602.14611). Issue-to-commit links fare no better — one study of six open-source projects found only about 60% of commits linked to an issue, leaving roughly 40% with no trace to what motivated them (Rath et al., 2018). So the reviewer routinely can’t trace a change back to its intent even when he wants to. The agent had the intent in hand the entire time it was working. You’re reverse-engineering it from the diff, on the cold desk, with the lights off.

A worn heavy fountain pen resting on a graphite bench beside a document bearing a single abstract looping signature mark struck in the one warm bronze note, lit by a cool inspection lamp.

The master of this trade re-performs the work

The countersignature catches errors only to the degree it re-performs the work. The signature that re-derived the judgment, not the one that admired the surface.

The break is the bench. The answer is to refuse to accept work that arrives in un-inspectable condition — and the trade that worked this out most rigorously is medicine, in the act of the countersignature.

The naive reading of a countersignature is that it’s a second name on the chart: the attending signs under the resident, presence added, safety improved. The evidence says that reading is wrong, and it says so with unusual clarity. Where the supervising physician genuinely re-performs the work — re-derives the judgment on the same inputs — the review catches errors. The cleanest case is the radiology over-read: an attending re-examines the same images the resident read and issues the binding final report. In a study of 9,072 preliminary resident reads, the attending re-read surfaced a major discrepancy rate of 1.7% — a small but real fraction of consequential misses, caught before they propagated (PMC10415655). A large VA surgical cohort points the same way: across 862,425 teaching encounters, residents supervised by a scrubbed attending — hands in the case, not signing off on a description of it — had fewer 30-day deaths, by 14.2% (PMC10735144).

Now the part that makes it a real finding instead of a slogan. Mere presence — a supervisor added but not re-deriving — bought nothing measurable. A randomized trial put attendings into work rounds versus merely available, and found the medical error rate not significantly different between the two arms (107.6 versus 91.1 per 1000 patient-days, P = .21), while the residents reported feeling less autonomous (Finn et al. 2018, PMC6145715). More sign-off did not equal fewer errors. The signature catches things only to the degree the act behind it reconstructs the judgment.

A sign-off catches defects only to the degree it forces re-performance of the work. The radiology over-read catches errors because the attending independently re-derives the read on the same images and issues a binding report. An LGTM that attests to having read a fluent artifact is the documented failure mode — not the success case.

There’s a caveat welded to this evidence. Every one of those “re-derivation works” findings comes from a procedural or perceptual setting — radiology, surgery — where the supervisor re-does an identical, observable task. That’s the case that least resembles reviewing a finished narrative artifact, which is what code review actually is; the one randomized test in a cognitive, document-heavy setting is the Finn trial — where added supervision did not significantly help. So the medical evidence supports re-derivation exactly where the work looks least like ours, and goes quiet where it looks most like ours. I’m carrying the master’s lesson — the safeguard is the reconstruction, never the signature — while being honest that it’s strongest in the setting furthest from mine.

A worn metal handover strongbox on a graphite bench, lid open, holding a folded document with a struck wax seal in the one warm bronze note, a sealed canister, and a coiled tape spool — the work arriving runnable, with provenance, with an auditable run-log — lit by a cool inspection lamp.

The handback contract

The handback contract: arrive runnable, arrive with provenance, arrive with an auditable run-log. The agent built all of it and threw it away at the boundary.

Here’s the move the countersignature teaches when you bring it back to the desk. The attending who re-performs catches errors; the one who merely signs doesn’t. But re-performance is expensive — and a cold desk makes it expensive enough that, under deadline, you do the cheap thing instead and just sign. A warm environment is what makes re-derivation affordable enough to actually happen. So the answer to a cold handback isn’t heroics. It’s a contract: a refusal to accept the work unless it arrives in a condition you can re-perform against.

Three demands, each the back-end counterpart of a discipline you’d impose at intake.

Arrive runnable. The change comes with a one-command-bootable, reproducible container definition and a seeded-state snapshot, so acceptance starts warm instead of paying the setup tax cold. The artifacts to do this already exist — devcontainers, reproducible builds — and, as we saw, the agent platform already built most of them per task. This isn’t asking anyone to invent machinery. It’s asking that the machinery the agent already constructed be exported with the diff instead of discarded at the boundary. (It isn’t free — a state snapshot can carry secrets or production-derived data, which is a real governance cost the contract has to price in, not wave away.)

Arrive with provenance. The plan, the prompt, the requirement link travel with the diff — PR templates, required issue links, the attached plan file — because the data above shows this is routinely absent and the reviewer is left reverse-engineering intent he was never handed. This is the unglamorous, available-today lever: it needs no new research, only the discipline the numbers show is missing.

Arrive with a run-log — kept auditable, not trusted. A record of what the agent actually exercised: which tests ran, which paths were touched, what passed. And here’s the load-bearing adjective, because it’s the whole difference between a tool and a trap: the run-log is evidence to be replayed and checked, not a green badge to be believed. There’s a thread of support for this — the one thing that fixed an over-rejecting AI reviewer wasn’t better prompting, it was making it run the tests; executable evidence cut its false negatives from 54.8% to 16.3% on one benchmark (arXiv 2603.00539, a single HumanEval/MBPP study, so hold it lightly). Executable evidence disciplines an unreliable judge. But a run-log you trust instead of re-run is just another fluent surface — which brings me to the hedge I promised at the top.

What I cannot tell you about my own remedy

The honest center of this piece is a question I can’t answer: does arriving warm actually make you a better acceptor?

I don’t know. Nobody does. There is no controlled or observational study isolating whether a runnable, warm-start handback improves what a reviewer catches over a flat diff — not in favor, and not against. It’s data-not-found in both directions. And it gets worse than merely unproven, because there’s a coherent argument that a warm bench could hurt. A one-command-green, pre-warmed, runnable environment is itself a fluency signal — and fluency, as the first break in this series established, is exactly what triggers the over-trust reflex. A runnable surface is one more confident, polished thing inviting you to relax. The warm bench that was supposed to make re-derivation affordable could just as easily make rubber-stamping easier. Running it is also structurally blind to the thing the third break was about: an environment replays what was done and can’t surface what was never done, so a passing run tells you nothing about the permission check that was never written. Warm-start helping is a hypothesis. It might be a vigilance cost dressed as a convenience. I’m proposing it as something to test, not something I’ve shown.

So why argue for the contract at all, if its headline benefit is unproven and might invert?

Because the contract doesn’t actually rest on the warm-start benefit. That’s the part worth sitting with. Arrive runnable, arrive with provenance, arrive with an auditable run-log stands as a demand on agent output regardless of whether warm-start is ever proven to help your vigilance — for a reason that needs no study: the artifacts existed on the agent’s side and were thrown away at the boundary. The container, the seeded state, the plan, the record of what ran — the agent had all of it, constructed in the course of doing the work, and discarded it at handback. Insisting it be exported instead of destroyed is not a bet on an unmeasured cognitive effect. It’s a refusal to accept work in deliberately worse condition than the condition it was produced in. The auditor refuses “trust me.” The attending won’t sign without re-performing. Neither of them is running an experiment on whether inspectable conditions improve their hit rate — they simply will not accept work they cannot inspect. The contract is how you refuse a cold handback. Whether the warm desk also sharpens your eye is a real and open question, and I’d genuinely like the study. The refusal doesn’t wait on it.

I work in an isolated, single-purpose environment I built for one kind of discovery, and the thing I notice most is not how fast it boots — it’s how much I trust a result when I can see exactly what produced it, and how little I trust the same result handed over as a clean summary with the workings removed. That’s the entire instinct the contract is trying to make into a habit. Not “make it warm so review is pleasant.” Make it inspectable, so the signature you put on it can mean what a signature is supposed to mean — that you reconstructed the judgment, not that you admired the surface.

The agent will hand you a flat diff on a stale branch and call it done. It had the warm world and kept it to itself. You can accept that, reconstruct what you can from the artifact alone, and sign on the cold desk — that’s the default, and the default is how the cheap signature wins. Or you can refuse the handback until it arrives in a condition you can actually re-perform against, knowing the warm bench is a discipline you’re imposing and a hypothesis you’re testing, not a proven cure. Either way the name going on the line is yours. The only question the contract settles is whether you’ll have rebuilt the world before you sign it, or just trusted the diff that someone else’s workshop produced and abandoned.

Re-deriving the change — re-performing the judgment rather than admiring the artifact — is the move under every answer in this series, and it’s the gold standard precisely because it’s the expensive one. Which raises the question the whole collection has been circling: there’s no rule that tells you when you’ve re-derived enough*. That’s What We Can’t Tell You Yet, the closer — where the gaps stop being caveats and become the point. (For the break before this one — why you can’t read it all, and how the trades decide what to read deeply — see You Can’t Read It All.)*