← true north

The Productivity Mirage

What AI coding tools actually do to developer output — and the 39-point gap between perception and reality.

Listen:

Structured from 5 source documents across the AI impact and market landscape research lenses. Source-reviewed, fact-reviewed, and gap-reviewed before publication. Contested evidence presented with source quality and vendor affiliation noted throughout.

The numbers tell two stories at once. Nearly every developer has adopted AI coding tools — 84%, with more than half using them daily. Barely any of them believe what the tools produce: only 3% report high trust in the output. And the gap between feeling and measurement is wider than either number suggests. Developers who use AI assistance believe they are 20% faster. The most rigorous independent study — a randomized controlled trial — found they were 19% slower.

Thirty-nine percentage points between perception and reality. It is the single most important finding in the AI productivity literature, and almost no one in the industry is talking about it.

The Stack Overflow 2025 developer survey — approximately 49,000 respondents — established the adoption numbers: 84% using or planning to use AI tools, 51% using them daily. The trust figure comes from the same survey. The perception gap comes from METR, an independent research organization that ran what remains the most methodologically rigorous study of AI-assisted development to date. Sixteen experienced open-source developers. Two hundred forty-six real issues on real codebases. Randomized assignment. Enough statistical power to reject the null hypothesis of zero effect.

The developers thought they were faster. The data showed the opposite. And the magnitude was not subtle.


What the independent evidence shows

The METR study is not the only research finding null or negative productivity effects. It is the best-designed, and it sits at the center of a pattern.

Microsoft ran a three-week internal study measuring telemetry metrics — time spent coding, pull request activity, output cadence — before and after Copilot adoption. The result: no statistically significant changes in any measured dimension. Developers self-reported feeling more productive. The instruments did not confirm it.

Faros AI conducted the largest study to date — 10,000+ developers across 1,255 teams — and found that teams with heavy AI use completed 21% more tasks and merged 98% more pull requests. Those numbers look like gains until you read the next line: PR review times ballooned 91%. At the organizational level, Faros found “no significant correlation between AI adoption and improvements at the company level” across DORA metrics, overall throughput, or quality KPIs. More code shipped. The same amount of software got delivered.

The pattern across independent studies is consistent: AI tools produce more output — more lines, more commits, more pull requests — without producing more outcomes. The bottleneck does not disappear. It moves. Writing code gets faster. Reviewing, debugging, and maintaining the increased volume of code absorbs the time saved.

Independent studies (left three) show null-to-negative results. Vendor-affiliated studies (right two) show gains. Both have methodological limitations — but the divergence is striking. Source: METR (2025), DX Newsletter, Faros AI (2025), Microsoft Research, Google/arXiv

What the vendor evidence shows

The positive productivity findings come almost entirely from studies funded by or affiliated with the companies that sell the tools.

The GitHub/Microsoft randomized controlled trial — published as an academic paper but funded by GitHub — found developers completed an HTTP server task 55.8% faster with Copilot. The gain is real, but the task was a single, well-defined implementation exercise in JavaScript. Less experienced developers and older programmers benefited most. Whether that result extends to multi-file refactoring, legacy codebase navigation, or ambiguous requirements is an open question the study does not address.

A Microsoft/Accenture field experiment found developers completed 12.9%–21.8% more pull requests per week at Microsoft and 7.5%–8.7% at Accenture after adopting Copilot. The authors themselves noted that both experiments were “poorly powered” — a significant qualification for a study claiming to measure productivity gains.

Google’s internal study, published on arXiv, found developers using AI IDE features completed tasks 21% faster. The most notable finding: senior developers saw the largest gains, challenging the assumption that AI primarily benefits juniors.

These studies are not fabricated. The gains they measure are likely real — for the specific tasks, in the specific conditions, over the specific time horizons measured. The problem is extrapolation. A 55.8% speed improvement on a greenfield HTTP server implementation does not mean 55.8% faster software delivery. Every study that measured organizational-level outcomes found the gain evaporated before it reached the product.


The bottleneck shift

Why does more code output not produce more software delivery? The Faros AI study offers the clearest answer: a 91% increase in PR review times.

AI tools are extremely good at generating code. They generate it fast, in volume, and with a surface plausibility that makes it easy to accept. But the code requires review. CodeRabbit, which sells AI-powered code review tools, analyzed 470 open-source pull requests and found a consistent quality deficit in AI-generated code: roughly 10.83 issues per AI-authored PR versus 6.45 for human-written ones. Logic and correctness errors rose 75%. Security vulnerabilities were up to 2.74 times higher. Performance inefficiencies appeared nearly 8 times more often.

GitClear’s analysis found that code refactoring — the practice of improving existing code structure — dropped from 25% of changed lines in 2021 to under 10% by 2024. Code duplication increased approximately fourfold. The codebase is growing faster while being maintained less carefully.

Veracode’s 2025 GenAI Code Security Report found that 45% of AI-generated code introduces security vulnerabilities, including critical OWASP Top 10 flaws.

A caveat on these sources is necessary and important: CodeRabbit, GitClear, and Faros AI all sell tools that benefit commercially from a narrative in which AI-generated code requires more monitoring and review. The directional finding — quality is degrading — is corroborated across multiple independent data points and consistent with what developers report experiencing. But the specific magnitudes should be treated as illustrative rather than precise. When three companies that sell monitoring tools all find that code needs more monitoring, the incentive alignment should be noted.

That said, the quality story has independent corroboration. The Stack Overflow 2025 survey found that 66% of developers cited solutions that are “almost right, but not quite” as their biggest AI frustration. Positive sentiment for AI tools dropped from above 70% in 2023–2024 to 60% in 2025. Only 3% of developers highly trust AI output. These are not vendors with a commercial interest. These are developers describing their own experience.

The picture that emerges: AI has not eliminated the bottleneck in software development. It has relocated it. The constraint used to be writing code. Now it is reviewing code, debugging code, and maintaining the growing volume of code that AI produces faster than humans can verify. A Fastly survey of 791 developers found that senior engineers report burnout from reviewing AI-generated code that does not work — and, counterintuitively, that seniors are 2.5 times more likely than juniors to ship code where over half is AI-generated (33% of seniors versus 13% of juniors). The developers with the most experience are leaning hardest on the tools while simultaneously reporting the most fatigue from cleaning up after them.


The perception gap

The METR finding — developers 19% slower but believing they were 20% faster — deserves more than a data point. It deserves an explanation.

METR’s study design was rigorous: experienced open-source contributors working on their own codebases, randomized into AI-assisted and unassisted conditions, across 246 real issues. The researchers found that developers accepted less than 44% of AI-generated code, spent 9% of their time reviewing AI output, and 4% of their time waiting for AI generations. The overall result was a statistically significant slowdown.

But METR, to their credit, published a February 2026 update acknowledging significant limitations. Approximately 30–50% of developers declined to submit tasks they expected AI to significantly help with, creating a selection bias that excluded the tasks most likely to show an AI benefit. METR explicitly states their original result is “likely a lower-bound on the true productivity effects” and that developers are “more sped up from AI tools now — in early 2026 — compared to our estimates from early 2025.”

This is honest science, and it is worth noting precisely because it complicates the narrative. The METR study probably understates AI’s benefit. It may understate it significantly. But the perception gap — the chasm between what developers believe AI does for them and what the best measurement shows — is real regardless of where the true productivity number lands. Even if the true effect is a modest gain rather than a loss, the 39-point gap between perceived and measured effect still exists. Developers are miscalibrating their own performance, and that miscalibration has consequences.

When engineering managers make hiring decisions based on the assumption that AI makes each developer 20–55% more productive, they are making decisions based on the perceived effect, not the measured one. When companies announce that AI productivity gains justify headcount reductions — as Salesforce’s CEO did when he announced no new software engineer hiring in 2025, citing 30% productivity gains — they are acting on estimates that independent measurement has not confirmed at the organizational level.

The perception gap is not a curiosity. It is a mechanism. It is how the labor market can simultaneously produce mass adoption (84%) and null organizational results: everyone feels more productive, so no one questions the headcount decisions that follow.


What developers actually use — and what they refuse

The adoption numbers obscure a crucial distinction. “Using AI tools” covers an enormous range of behavior, and most developers are using a narrow slice of what is available.

GitHub Copilot crossed 20 million all-time users in July 2025 and holds roughly 42% market share among paid AI coding tools. Cursor — the breakout competitor — reached $2 billion in annualized revenue within 18 months, valued at $29.3 billion. The JetBrains 2025 survey found 85% of developers regularly use AI tools and 62% rely on at least one AI coding assistant.

But Faros AI’s research revealed that most developers use only autocomplete features. Advanced capabilities — agentic coding, multi-file generation, autonomous issue resolution — remain largely untapped. The Stack Overflow survey found 52% of developers do not use AI agents at all, and 38% have no plans to adopt them.

More revealing is what developers actively reject. Seventy-six percent will not use AI for deployment and monitoring. Sixty-nine percent decline AI integration for project planning. These are not marginal features being ignored. They are the categories of work where judgment, context, and consequence matter most — and developers have decided, by wide margins, that AI is not trustworthy there.

Copilot now contributes 46% of all code written by its active users, up from 27% at launch. If the quality-degradation findings hold directionally — and every study suggests they do — the scale of accumulated technical debt is significant. Nearly half the code being written is AI-generated, and the people writing it trust it less than they trust their own output by a wide margin.


The honest synthesis

The evidence on AI developer productivity genuinely conflicts, and resolving the conflict dishonestly — picking a side for convenience — would disrespect both the research and the reader. Here is what can be said with confidence and what cannot.

What the evidence supports: AI tools reliably accelerate bounded, well-defined tasks — boilerplate generation, autocomplete, test scaffolding, documentation, single-file bug fixes. For these tasks, the gains are real, measurable, and not seriously contested. Developers who use autocomplete and code generation for routine work are completing those specific tasks faster.

What the evidence contests: Whether those task-level gains translate into faster software delivery at the team or organizational level. The best independent evidence says they do not — that increased output is absorbed by increased review, debugging, and maintenance. The vendor-affiliated evidence says they do, but from studies with acknowledged methodological limitations, measuring shorter time horizons, and funded by companies with a commercial interest in the result.

What the evidence strongly suggests: Code quality is degrading. The specific magnitudes are uncertain — the primary sources have commercial interests — but the direction is corroborated by developer sentiment (3% high trust, 66% frustrated by “almost right” output, positive sentiment declining). The bottleneck in software development has shifted from code generation to code verification.

What the evidence cannot resolve: Whether AI will eventually deliver the organizational-level productivity gains that current studies fail to find. The tools are improving. The METR researchers themselves say their results are likely a lower bound. The question is whether improving tools will close the gap between task-level speed and organizational-level delivery, or whether the review and maintenance overhead scales proportionally with output — in which case the bottleneck simply moves forever.


The professional implications

The productivity mirage has practical implications that cut differently depending on where you sit.

The perception gap is a professional hazard. If the METR finding generalizes — and 39 percentage points of miscalibration is not a rounding error — then your intuition about how AI affects your own productivity is unreliable. This does not mean AI is useless. It means your sense of how much it helps you is probably inflated, and decisions based on that inflated sense (about how many tools to adopt, how much to invest in learning them, how to structure your work) may be miscalibrated. The corrective is measurement, not intuition. Time your work. Compare outputs. Be honest about what the tools actually changed.

The bottleneck shift is a career signal. If the constraint in software development is moving from writing code to verifying code, the skills that gain value are the ones on the verification side: code review, debugging, security analysis, architectural assessment, and the judgment to know when AI-generated output is subtly wrong. The skills that lose value are the ones AI handles well: boilerplate generation, syntax-level implementation, mechanical translation of specifications into code. This is the same dividing line — knowing why the system behaves a certain way versus knowing how to make it behave — that “Sixteen Funerals” traced across 150 years of historical precedent, operating in the present tense.

The quality debt is accumulating. If AI-generated code consistently carries more defects per pull request, if refactoring has dropped below 10% of changed lines, if code duplication has quadrupled — and if nearly half the code your team produces is AI-generated — then a maintenance reckoning is coming. It may surface as escalating bug rates, security incidents, or architectural brittleness. Developers who can diagnose and remediate that debt will be in demand. The debt itself is an argument for experienced developers, not against them.

The vendor marketing is running ahead of the evidence. When a CEO announces “30% productivity gains” to justify a hiring freeze, ask what the measurement methodology was. When a tool vendor cites a 55% speed improvement, ask on what task, for how long, and at what organizational scale. The evidence does not support the productivity claims being used to make hiring decisions across the industry. It may support them eventually. It does not support them now. The gap between what is claimed and what is measured is wide enough to build bad strategy on — and many companies are doing exactly that.

The mirage is not that AI tools are useless. They demonstrably are not. The mirage is that their effect on individual tasks — real, visible, felt — translates proportionally to organizational outcomes. The best evidence says it does not. That gap between task-level perception and system-level reality is where careers get miscalculated, hiring decisions get distorted, and technical debt accumulates unseen.

The stopwatch does not care what you believe.