DORA Metrics in AI Era: Why Developer Productivity Frameworks Need a Reboot

DORA metrics assume deployment velocity correlates with value. AI tools broke that correlation. I explain why.

Dec 10, 2025

Linus Torvalds was recently asked about lines of code as a productivity metric. His response? “Lines of code are not something you should ever attribute any importance to.” When asked about companies firing developers for not writing enough code, he was blunter: “Anybody who thinks that’s a valid metric is too stupid to work at a tech company.” If lines of code never worked, what makes us think deployment frequency or commit velocity are any better in the AI era?

The Productivity Measurement Landscape

Before I tear these frameworks apart, let me explain what engineering leaders are actually using today. After coordinating 3,000+ technology professionals across major transformation programs, I have seen every flavor of developer productivity measurement.

DORA Metrics (DevOps Research and Assessment) remains the gold standard. Four metrics that supposedly predict elite team performance:

Deployment Frequency: How often you ship to production
Lead Time for Changes: Time from commit to production
Mean Time to Recovery (MTTR): How fast you recover from failures
Change Failure Rate: Percentage of deployments causing incidents

SPACE Framework (from Microsoft Research) tries to be more holistic:

Satisfaction: Developer happiness and fulfillment
Performance: Outcomes and impact
Activity: Observable actions (commits, PRs, deployments)
Communication: Collaboration quality
Efficiency: Flow state, minimal interruptions

DevEx Framework focuses on three dimensions: feedback loops, cognitive load, and flow state. It acknowledges that productivity is about the experience, not just the output.

DX Core 4 (launched December 2024) attempts to unify all previous frameworks into four counterbalanced dimensions:

Speed: How quickly development work progresses and is delivered
Effectiveness: Productivity and output relative to resources invested
Quality: Reliability and stability of software being produced
Business Impact: Alignment with organizational goals and revenue outcomes

DX Core 4 is the newest framework, developed in collaboration with the original DORA, SPACE, and DevEx authors. It explicitly acknowledges AI coding tools like Copilot and claims to measure “AI impact” through usage analytics and time savings.

This is progress. At least someone is paying attention to AI.

But is it enough?

If this resonates with your measurement struggles, clap so other engineering leaders can find it.

What Managers Actually See

Here is what these frameworks look like from a manager’s dashboard. I have sat in countless steering committees where these numbers get presented like gospel.

DORA tells managers: “Your team deployed 47 times last week. Lead time is down to 2 hours. You are elite performers.”

SPACE tells managers: “Developer satisfaction is 4.2/5. Activity metrics show 23 PRs merged per developer.”

DevEx tells managers: “Cognitive load scores are acceptable. Flow state interruptions average 3.2 per day.”

DX Core 4 tells managers: “Speed score: 8.2/10. Effectiveness: 7.8/10. Quality: 9.1/10. Business Impact: Strong alignment. AI tools saved 4.2 hours per developer this week.”

The numbers look scientific. They create beautiful charts. They give executives something concrete to discuss in quarterly reviews.

The fundamental assumption behind DORA? Deployment velocity (frequency and speed) correlates with value delivery. Teams that deploy faster and more frequently deliver more value.

That assumption was reasonable when effort and velocity were tightly coupled. AI broke that coupling. You can now have velocity without proportional effort, and effort (prompt engineering) without visible velocity.

AI Coding Behaviors That Break Everything

Let me show you what is actually happening in development teams right now. These are not edge cases. These are daily realities.

Copilot and Cursor Code Generation

A developer using GitHub Copilot can accept 30–40% of suggested code. That is not their code. That is not their thinking. That is AI generating boilerplate, tests, and implementations based on patterns.

Deployment frequency goes up. Lead time goes down. But did the developer actually solve a harder problem? Or did they just accept more suggestions?

When I architected 14 GDPR and ISO 27001 compliant platforms, the hard work was never the code volume. It was understanding the constraints, navigating the trade-offs, and making decisions that would not blow up two years later.

Vibe Coding: The New Development Pattern

This is the one that really breaks DORA. “Vibe coding” is what happens when developers use AI as an experimental partner. They try something. AI generates it. They test it. It fails. AI fixes it. They iterate.

The commit history shows incredible activity. Deployment frequency looks elite. But the underlying work pattern is fundamentally different from traditional software engineering.

It is less “design, implement, verify” and more “prompt, generate, experiment, refine.”

Is that worse? Not necessarily. But it is immeasurable by current frameworks.

ChatGPT Copy-Paste Without Attribution

This is the dark matter of modern development. Developers copy solutions from ChatGPT into their codebase. There is no commit message saying “solution from Claude.” There is no way to attribute the intellectual work.

Your DORA metrics show a developer who ships fast. Your reality is a developer who is a sophisticated copy-paste operator with good judgment about what to paste.

The judgment part is valuable. The measurement misses it entirely.

Have you seen these patterns in your organization? I want to know if I am the only one watching this unfold.

Prompt Engineering as Core Skill

The qualified developers I work with now spend significant time crafting prompts. Not writing code. Crafting prompts.

They understand how to decompose problems for AI consumption. They know how to iterate on prompts to get better outputs. They can spot when AI is hallucinating and course-correct.

None of this shows up in any productivity framework.

Deployment frequency measures what got shipped. It does not measure the prompt engineering skill that made it possible. Lead time measures speed. It does not measure the strategic thinking about how to leverage AI effectively.

The Linux Kernel Paradox

Here is something that should make every DORA advocate uncomfortable.

The Linux kernel has been releasing on a 9-week cycle for over 20 years. That is approximately 5.7 releases per year. By DORA standards, this is nowhere near “elite” performance. Elite teams deploy multiple times per day.

Let me give you the numbers that matter:

The Linux kernel powers over 3 billion Android devices. It runs on 96% of the world’s top 1 million servers. It operates the International Space Station, the Large Hadron Collider, and the majority of cloud infrastructure that runs the internet.

Over 30,000 developers from 500+ companies have contributed to the kernel. Linus Torvalds himself barely writes code anymore. He sends email snippets asking others to implement the final tested version.

I am a human writer who gets motivated to write more with your support! You don’t need to pay. I just need your clap 👏 if you like my story and comment ✍️ if you want to say something. You can follow me on Medium, LinkedIn, Instagram, and X.

The Business Outcomes Tell a Different Story

In 1991, Linus started Linux as a hobby project. In 2005, he created Git because he needed better version control for Linux development. Both projects followed the same “slow by modern standards” development approach.

Fast forward to the business impact:

In 2018, Microsoft acquired GitHub (built on Git) for $7.5 billion. In 2019, IBM acquired Red Hat (built on Linux) for $34 billion. SUSE, Canonical, and countless other companies built billion-dollar businesses on top of Linux.

But the economic multiplier effect is where it gets staggering. Every major cloud provider (AWS, Google Cloud, Azure) runs on Linux infrastructure. The entire Android ecosystem (3 billion+ devices) runs on the Linux kernel. An estimated 90% of public cloud workloads run on Linux. The global cloud infrastructure market exceeded $500 billion in 2024, and Linux powers the foundation.

Git, meanwhile, has become the universal version control system. Every software company, every startup, every open source project uses Git. The productivity gains from distributed version control have enabled the entire modern software development industry.

Two projects. One developer who “barely writes code anymore.” Both rated as “low performers” by DORA standards. Combined economic impact measured in trillions of dollars.

So what would DORA metrics say about Linux kernel development?

Deployment Frequency: 5.7 releases per year. Low performer.

Lead Time for Changes: Features can take 6–18 months from initial patch submission to mainline inclusion. Multiple review cycles. Extensive testing. By DORA standards? Embarrassingly slow.

Change Failure Rate: The stable kernel releases have extremely low failure rates, but the development process is conservative and cautious. Not the “move fast and break things” that DORA velocity metrics reward.

MTTR: When critical bugs are found, the community responds quickly. But the process is distributed, consensus-driven, and involves thorough analysis before fixes. Not the automated rollback that modern DORA dashboards celebrate.

By the metrics we use to judge “elite” engineering teams, Linux kernel development would be rated as a low to medium performer.

Yet it is arguably the most successful, most impactful, most stable software project in human history.

Before you dismiss this as an unfair comparison, let me be clear: DORA metrics claim to be universal. The original DORA research and the “Accelerate” book never said “only apply these to commercial startups with venture funding.” The framework claims to measure software delivery performance regardless of context — commercial, open source, enterprise, or startup.

If DORA cannot distinguish between “deliberate, high-impact slow” (Linux) and “dysfunctional slow” (a struggling team), then DORA is measuring the wrong things. The framework is context-blind, treating all velocity as equal and all slowness as failure.

This is not an AI problem. This is a fundamental problem with activity-based metrics measuring the wrong things.

When Torvalds dismisses lines-of-code metrics, he is speaking from experience building something that DORA would rate as mediocre while it literally powers the modern world.

The Measurement Reality Gap

Credit: Author, Developer Productivity Measurement Evolution from Traditional to AI Era

The diagram shows it clearly. Traditional measurement captured the whole workflow. AI era measurement captures only the final step.

The gap is where the real work happens now.

Are These Frameworks Deprecated or Gaming Targets?

Let me take a strong position: DORA metrics are fundamentally broken for AI-assisted development. Not “need adjustment.” Broken.

Here is why:

Gaming is Now Trivially Easy

Want better deployment frequency? Accept more AI suggestions, deploy smaller changes, automate more. Your metrics look elite. Your actual problem-solving ability? Unmeasured.

Want faster lead time? Let AI write your tests. Auto-approve with AI code review tools. Your pipeline looks incredible. Your code quality? Unknown.

I have seen teams achieve “elite” DORA status while shipping mediocre products. I have seen “low performer” teams doing groundbreaking architectural work that DORA cannot capture.

The Cheating Code Exists

Developers are smart. When you measure deployment frequency, they deploy frequently. When you measure commits, they commit often. When you measure PRs, they create more PRs.

AI makes this gaming even easier. You can generate plausible-looking commits in seconds. You can create PRs that pass automated checks without human thought.

The framework becomes Goodhart’s Law incarnate: “When a measure becomes a target, it ceases to be a good measure.”

Managers Cannot Keep Up

Most engineering managers learned their craft before AI coding tools existed. They understand traditional productivity patterns. They do not understand the quality of prompt engineering or AI curation skills.

When a developer ships fast using AI, the manager sees speed. They cannot evaluate whether that speed came from skilled AI usage or mindless acceptance.

This is not an insult to managers. This is a structural problem. We lack the vocabulary and frameworks to discuss AI-assisted productivity.

What About DX Core 4? Is It Close Enough?

DX Core 4 deserves credit. It acknowledges AI exists. It includes “Business Impact” as a core dimension. It attempts to measure AI tool effectiveness through usage analytics and time savings.

This is closer to what we need than DORA’s blind worship of velocity.

But here is the problem: DX Core 4 treats AI as a productivity tool, like a better IDE or a faster compiler. It measures time saved. It measures code acceptance rates. It measures deployment improvements.

It still measures AI as an output accelerator, not a fundamental transformation of what development work means.

The framework launched in December 2024, right as AI coding tools were exploding. The timing suggests they were responding to AI, not anticipating it. And the response is to measure AI’s impact on existing metrics (speed, effectiveness) rather than questioning whether those metrics still matter.

When DX Core 4 measures “AI tools saved 4.2 hours per developer this week,” what is it actually measuring? Time saved accepting Copilot suggestions. Time saved copying from ChatGPT. Time saved on boilerplate generation.

What it is NOT measuring:

The quality of prompt engineering that made those 4.2 hours possible. The strategic decisions about when to use AI and when not to. The curation judgment that separates good AI suggestions from garbage. The invisible iterations before the final commit. The new cognitive work of orchestrating AI rather than writing code.

DX Core 4 is progress. But it is measuring the shadow, not the substance.

What Comes Next

I am not going to pretend I have the perfect replacement framework. I do not. But I can tell you what needs to change.

Outcome-Based Measurement Over Activity

Stop measuring how often you deploy. Start measuring what impact those deployments had. Did user satisfaction improve? Did revenue metrics move? Did you solve the actual problem?

This is harder. It requires connecting development activity to business outcomes. But it is the only thing that matters.

AI Leverage Quotient

We need a way to measure how effectively developers use AI tools. Not how much they use them, but how effectively.

The best developers I know can get AI to solve problems that average developers cannot even prompt correctly. That skill difference is invisible to current metrics.

Curation Quality Over Code Quantity

The job of a developer is increasingly curation: knowing what to accept, what to reject, and what to modify from AI suggestions.

We should measure the quality of those curation decisions. Not the volume of code shipped.

Sustainable Delivery Over Speed

DORA fetishizes speed. Lead time down. Deployment frequency up. Recovery fast.

But sustainable delivery means systems that do not need constant recovery. It means architectures that can evolve without continuous firefighting. It means code that the next developer can actually understand.

If you are an engineering leader struggling with these questions, clap to help others find this perspective.

The Linus Lesson

Torvalds called lines-of-code measurement “too stupid to work at a tech company.” He explained that Linux has 5 million lines of code for hardware descriptions, but they are generated from hardware specs that AMD provides. The line count means nothing about productivity.

The same logic applies to DORA metrics in the AI era.

Deployment frequency means nothing when AI can generate deployable code in seconds. Lead time means nothing when the actual work is invisible prompt engineering. MTTR means nothing when AI can write recovery scripts faster than humans can diagnose problems.

The metrics were designed for a world where code was handcrafted. That world is disappearing.

After 20+ years building systems across telecommunications, digital health, media, and conversational AI, I have learned one consistent lesson: the metrics that matter are the ones closest to actual business outcomes, not the ones easiest to automate.

Experience matters more than articles. If you are measuring developer productivity, stop looking at dashboards and start looking at outcomes. Talk to your users. Check your revenue. See if the problems actually got solved.

The frameworks will catch up eventually. Until then, trust your judgment over your metrics.

What productivity patterns are you seeing with AI-assisted development in your organization? I am genuinely curious if others are experiencing the same measurement breakdown.

/var/log/canartuc

Discussion about this post

Ready for more?