~/color-code

The House Always Catches Up

The platform is shipping the community's best plays as built-ins — so what happens to the team whose signature move just became a button?

Listen:

Synthesized from a 16-document research corpus across five cascaded domains (15 lenses). The method here is archival and forward-looking: the absorption cadence was read off dated platform primaries — shipped flags, version notes, and vendor docs — and corroborated against independent community primaries (the open feature-request thread, retrieved as API JSON with dates, states, and reaction counts verbatim). The standardization spine was blind-re-derived against two independent primaries (a Linux Foundation release and a vendor post); the darkhorse metrics were live-pulled via the GitHub API; the frontier preprints were existence-checked against their abstracts. Forecasts are tagged as forecasts. Source-reviewed, fact-reviewed, and gap-reviewed; the cross-domain synthesis passed an independent fidelity review before publication.

The league’s best plays keep vanishing off the board — and not one of them lost. They got shipped. A worktree flag here, a memory tool there, a multi-agent orchestrator the next quarter, and a team’s whole hand-built playbook is suddenly just the field everyone stands on. This is the back half of the season, where you stop asking which team ran the smartest move and start asking what happens to a team after the house turns its signature move into a button.

One line on the vantage, because the forecasts start here and a forecast should announce itself. We’re at the turn of the season now, still on the same field — the one building tools that take a written spec and shape it into the prompt sequence a coding agent runs on — and the home team in this booth is Superpowers, the framework I work in myself. Which means it gets the hardest call in this segment too, and the hardest call is this: the house is catching up to the home team, right now, in the open. That’s not a defeat. Watch why.


A worn hand-built brass lever switch wired onto a scuffed metal panel, beside a smooth flush machined push-button set into the same panel, under a broad warm booth light falling off to navy.

The home team’s playbook, retrofitted to a button

The hand-built switch becomes a flush native button. The mechanism commoditizes; the judgment of when it bites does not.

Here is absorption happening to the team I run, logged by the people it’s happening to.

For most of this season the strongest move in multi-agent work was hand-built: a suite of skills — subagent-driven-development, dispatching-parallel-agents — that taught one orchestrating agent how to fan helpers out and pull their work back. Then the platform shipped Agent Teams: a native team lead, native teammates, each its own session, a shared task list, peer-to-peer messaging. The hand-built orchestration became a primitive underneath the hand-built skills. And the community didn’t have to be told — it filed the retrofit request itself. From the issue thread (created 2026-02-05, still open, 112 reactions as of access):

Superpowers currently has skills for subagent-driven development (subagent-driven-development, dispatching-parallel-agents) but no awareness of or support for the newer Agent Teams primitives … Skills like executing-plans and subagent-driven-development only know about the Task tool (subagents), not Teammate, SendMessage, or the team-aware TaskCreate/TaskUpdate/TaskList tools.

Read what that’s conceding. The skills only know about the Task tool — the hand-built suite speaks the old vocabulary, and the platform just shipped a new one underneath it. A stopgap fork appeared inside three weeks to bridge the gap until the main line catches up. That’s the absorption in one frame: a team’s signature move becoming a native primitive, and the team’s own users asking it to please learn the words the house now ships.

And here’s the part that turns “the house caught up” from an obituary into the actual finding — the button arrived with the community’s caveat printed on it. The Agent Teams docs don’t just expose the mechanism; they encode the discipline the field paid for in merge conflicts. The vendor’s own best-practice text reads:

Two teammates editing the same file leads to overwrites. Break the work so each teammate owns a different set of files.

Single writer, partition the files. That is the exact rule the home team forbade parallel implementers to protect, and the exact rule a swarm violates — now shipped as the platform’s own advice. The mechanism got commoditized. The judgment of when the rule bites did not, and could not, because a primitive can carry a warning label but it can’t carry the decision about when the label applies. Hold that line; the whole segment turns on it.


Two tracks, one consolidation

Step back from the one case and the cadence is unmistakable, and it runs on two tracks at once.

On the first track, the host vendor is pulling the community’s method-plays down into native primitives, on a dated clock anyone can check: native subagents (mid-2025), then a 1M-token window and a file-based memory tool the same generation (August–September 2025), then OS-level sandboxing reporting “84% fewer permission prompts” (2025-10-20), then native git-worktree isolation, then experimental Agent Teams (early 2026). The two plays this booth graded as necessity back in the centerpiece — write-isolation and externalize-durable-state — are both native now. The move I’d call physics became the field itself. A careful note on causation, because it’s interpretive: the primitive maps onto the community move, shipped after it, in the same shape. That’s strong circumstantial evidence the house was watching the community. It is not a proof that the house shipped it because the community proved it works — nobody has shown the absorbed primitive produces better code than the hand-built version. Absorption is an adoption signal, not a quality verdict. Keep those apart all the way to the closer.

On the second track, the interface and config layers are calcifying the other direction — not into one vendor’s primitives but into neutral, cross-vendor standards. The single hinge is dated: on 2025-12-09 the Linux Foundation formed the Agentic AI Foundation, with eight founding members — AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, OpenAI — and two originating rivals handed their conventions to one shared body: Anthropic donated MCP (the tool-and-context connector, which its own steward blog calls the “de-facto standard,” claiming “more than 10,000 active public MCP servers” and “97M+ monthly SDK downloads” — a steward’s self-report, so weight it as one), and OpenAI donated AGENTS.md (the agent-config file, on a curve from 20,000 projects at its August 2025 launch to “more than 60,000 open source projects” by the December release — one dated press figure, measuring reach, not quality). Two competitors donating their standards to a body they jointly govern is the strongest signal in the segment that this layer is settling. The asterisk rides along in the clause: “neutral” is vendor-governed — three of the eight founders sit on both sides of that table — and the read is six months into a weeks-old arrangement.

So the field is consolidating on two tracks: the judgment-encoding plays get absorbed by a single vendor, and the low-disagreement plumbing floats up into a shared standard. Build on the standards — that race is largely over. The plays are a different matter, and that’s where the forecasts get honest about what they don’t know.


What the house can’t decide for you

Here is the question the dated record genuinely cannot answer: when the house absorbs your play, does your tool survive — reshaped — or get stranded?

The modal outcome in tool history is survival-by-relocation. Linting didn’t die when the platform shipped ESLint; it became infrastructure, and the craft moved up the stack from inventing the mechanism to encoding judgment on top of it. That’s the optimistic precedent, and it’s the likely one. But the counter-case is real and it’s on the record. Bower was Twitter’s front-end package manager — for a stretch it was the blessed default for browser assets, the convention everyone adopted. Then a better-integrated stack displaced it, and its own homepage now waves you off:

While Bower is maintained, we recommend using Yarn and Vite for front-end projects.

A blessed default, pointing you at the rivals that beat it. That’s the outcome the linting-to-ESLint story leaves out: sometimes the house doesn’t relocate the incumbent up the stack, it strands it. Which way absorption breaks for the community tools in this field is a forecast, precedent-only, and I’m tagging it as one — the dated record doesn’t decide it, and the re-stabilization clock (community demand re-settles in days; the primitive itself is still churning its own surface version to version) says we’ll know within a generation or two, not now.

One discipline this whole segment runs on, cheap to state and earned the hard way: date every platform fact and name the generation. The cautionary tale is the 1M window’s launch “price premium” — real for the Sonnet 4 beta in August 2025, and quietly removed at general availability the next spring, when the vendor stated “there’s no multiplier: a 900K-token request is billed at the same per-token rate as a 9K one.” A claim keyed to a superseded generation isn’t wrong; it’s stale. In a field that rebalances monthly, the date is the receipt.


A small, finely machined brass-and-steel gate latch with one clean lever, centered in a narrow shaft of warm light, faint chalk play-marks scattered in the navy dark around it.

The darkhorse you pick on technique, not adoption

Called on the move, against the standings. A precise gate, picked on technique — not adoption.

Every other play this season earned its place by catching on. This one I’m calling the other way — on the move, against the standings — and saying so out loud, because it’s the one forecast where the booth bets ahead of the crowd.

The far end of the spec spectrum, the one the borrowed-play segment pointed at and left for later, is a small Rust tool called agent-spec. Its distinguishing move is the one the centerpiece already admired: it binds each acceptance scenario to a named test function and derives the verdict from a process exit status — output.status.success() becomes a pass, anything else a fail — so the judgment of “done” is a deterministic gate with no model in the loop. That’s the sharpest answer in the league to the honest seam the centerpiece left open: an executed test is only as good as what it asserts, and most agentic tests are written by the same model that wrote the code. Binding completion to a named, executable contract is the cleanest discipline anyone’s shipping against that problem.

Now the standings, called straight against my own bet. As of access, agent-spec carried 269 stars, 21 forks, an active 0.3.0 release cycle, 30 commits in the trailing month — and zero open issues, its repo’s open_issues_count of 1 being a single open pull request, the maintainer’s own. Every bit of that motion is maintainer activity. Third-party adoption — a non-maintainer building on it, a marquee tool binding its completion criteria to named tests — is, flatly, data not found. So I’m not telling you agent-spec wins. The history says it probably doesn’t, and the precedent is specific: property-based testing took roughly two decades to go from one Haskell paper to “57 reimplementations in other languages” — and it spread as an idea re-implemented in each ecosystem, never as the original tool accruing dependents. Mapped onto this play, the likely path to the mainstream is a popular spec tool absorbing test-binding-as-gate, not this particular repo winning. Which is exactly the move a senior makes: steal the technique now, don’t wait for the tool to be blessed. Pick it on the film, not the gate receipts.


Two worn upright metal posts with empty hinges and bare bolt-holes facing each other, nothing hung between them — an empty gate-mount, the opening unfilled, lit by a warm pool falling off to navy.

The gate nobody’s shipped

The most-demanded, least-built thing in the field. The brackets wait; the gate is unshipped.

The densest argument of the first half — the swarm that buys speed, not quality — ended on a seam it couldn’t close: there’s no gate that decides when a review loop is load-bearing and when it’s theater, and the most enthusiastic users in the league are paying for ceremony on work that doesn’t need it. That gate is the single most-demanded, least-built thing in this whole field. It’s worth being precise about what it is and isn’t.

It is not a void nobody noticed. It’s an old idea with loud, dated demand — the same users logging “10–15x” overhead on trivial jobs were asking, explicitly, for ceremony that scales with task size. And the compute layer has already shipped the shape of the answer: Claude’s effort parameter is a five-tier dial — low through max — that auto-routes how hard the model thinks, where “at lower effort levels, it may skip thinking for simpler problems.” That’s the compiler -O pattern arriving at the model layer: a small set of named effort tiers, selected by a coarse signal. The precedent is clean and the demand is documented. What’s missing is the same dial one layer up — a harness-level gate that routes how much review ceremony runs by something like diff size, instead of running the full wave on every three-line change. A skill that does that is data not found. Not unthinkable. Not even hard to picture. Just unbuilt — the sharpest open opportunity the evidence points at, demanded and un-shipped.

The same shape governs the deepest open problem under all of it. The executed-test gate is external in mechanism but circular in ground truth when the model writes its own tests — and the field is dated, visibly, routing around that, not closing it. Execution-grounded oracle synthesis (Nexus, a preprint) reports bug-detection accuracy climbing “46.30% to 57.73%” on one benchmark; property- and example-based testing combined (Tanaka, a preprint) lifts detection from “68.75%” each to “81.25%” together. Real movement, all of it tagged preprint and unreplicated — and none of it closes the problem, because the best of it still grounds its judgment against a model-synthesized reference. Mitigation, not a solution. The oracle problem is decades old; stronger models never solved it, techniques that sidestep needing the right answer did. Same story here, still being written.


The call on the back half, and it’s the one the brand is built to make. The house is catching up — to every team, mine included — and absorption is not obsolescence. The invention layer is commoditizing: the plays you hand-built are becoming native, and you should drop the bespoke scaffolding and let the platform carry the mechanism. But the button ships with the community’s caveat printed on it, never the judgment of when the caveat bites — when a single writer beats a swarm, when a plan should be gated, how much state to externalize before a bigger window earns its cost, when a deterministic test-bind is worth more than its ceremony. That judgment is the durable edge the primitives and the standards don’t supply, and it’s the whole reason there’s a senior in this booth at all. The house caught up to my playbook this season. It did not catch up to the part of the job that reads the field and makes the call — and that part outlives any one tool, including the one I run on Tuesday. This is the upstream answer to the pipeline’s back end: get the prompt sequence and the gates right here, before any output exists, and you’ve at least loaded the dice against shipping legacy code at birth. What none of it is — not the absorbed play, not the blessed standard, not the darkhorse — is proven to win games. That’s the last call of the night, and the scoreboard still won’t settle it.


This is the fifth call in Color Code — the middle of a three-part arc on the agentic pipeline, looking upstream of Object Code at how the prompt sequence gets engineered before any output exists, and before the platform absorbs the moves that work. Previously: The Plays the Box Score Loves, on the swarm that buys speed, not quality. Next, the sign-off — every play real, converged, and platform-blessed, and not one proven to ship better code: Does Any of This Win Games?