Is This Really What You Want to Measure?

Measuring Engineering Outcomes

Apr 22, 2026

I’m reading C. Thi Nguyen’s new book, The Score, and one idea in it keeps coming back to me. He calls it value capture,the process by which a rich, meaningful goal gets quietly replaced by the metric you were using to track it. You start measuring something because it points toward what matters. Then, gradually, the measurement becomes what matters. The original goal doesn’t disappear. It just stops being the thing that drives decisions.

That’s not a philosophy problem. That’s every day in most engineering organizations.

What we actually measure

Let’s be clear about what passes for engineering measurement in most companies.

Output metrics are the default: story points, tickets closed, pull requests merged, lines of code written. They’re easy to collect, easy to visualize, and they feel like signal. The problem is they measure production, not value. A team can close 200 tickets in a sprint and ship nothing a customer cares about. Story points aren’t a unit of value, they’re a unit of negotiated effort, and that negotiation starts the moment someone decides to track them.

Performance metrics try to go one level deeper: code review turnaround time, sprint commitment hit rate, on-call response time, test coverage percentages. These are more interesting because they reflect process health. But they still measure fidelity to a process, not effectiveness of the work. A team can hit 95% sprint commitment every week by sandbagging estimates. Reviews can be fast because nobody’s actually reviewing.

Efficiency metrics are where most engineering organizations have landed recently, particularly DORA, the four-key-metrics framework: deployment frequency, lead time for changes, change failure rate, and time to restore service. DORA is genuinely useful as a diagnostic. The problem is what happens when it moves from a team-level health check to a leadership dashboard. Deployment frequency gets gamed by trivializing deployments. Lead time gets gamed by where you start the clock. You end up with a team deploying 15 times a day that is still six months from shipping anything meaningful.

Every one of these metric categories lives entirely inside the engineering system. None of them has a direct connection to whether the engineering organization is actually doing its job.

The game you didn’t know you were playing

Nguyen’s book makes a distinction that hits differently in an engineering context. He separates striving play from achievement play. In striving play, the goal is to engage fully, the process, the judgment, the craft. In achievement play, the only thing that matters is the score. Great games are designed so that chasing the score also produces striving. You can’t get good at chess by gaming the scoring system; you actually have to get good at chess.

Institutional metrics are the opposite. They strip out the magic circle, Nguyen’s term for the temporary, voluntary frame that makes game constraints feel meaningful rather than oppressive. In a board game, you accept arbitrary rules because you chose to sit down and play. In a work context, those rules aren’t arbitrary and they aren’t optional. The score follows you. It shows up in your performance review. It gets presented to the board.

What’s left, once you remove the magic circle, is a system that rewards achievement play. And engineers, who are, professionally, some of the best problem-solvers in any room, find the optimal path to the score. This isn’t cynicism. It’s a completely rational response to the incentive structure you built.

The consequences are predictable:

Velocity becomes sandbagging. The moment team velocity appears on a leadership dashboard, estimation inflates. Points expand to protect the team. After a few quarters, the number is politically stable and informationally useless.

Deployment frequency rewards triviality. If deploying frequently is the metric, the rational move is to break work into smaller pieces, not because small batches are better (they often are, but for different reasons), but because each deployment ticks the counter.

Commitment rates reward conservatism. Measure whether a team delivers what they promised and they’ll promise less. You’ll see consistently green dashboards and a team that’s quietly becoming slower.

Code review speed becomes rubber-stamping. If time-in-review is visible, reviewers learn to approve fast. Technical debt accumulates invisibly while the metric looks healthy.

This is Goodhart’s Law in action, once a measure becomes a target, it ceases to be a good measure. But Nguyen’s framing adds something important: it’s not just that the metric gets corrupted. It’s that people’s values get reshaped around it. The engineers optimizing for velocity aren’t lying. They’ve internalized the metric. The metric has become, for them, what good work looks like. That’s value capture. That’s the thing that’s actually hard to fix.

What these metrics are actually telling you

This is worth being precise about, because the answer isn’t “nothing.” These metrics have legitimate uses. The failure is the mismatch between what they measure and what leaders use them to decide.

Output metrics, velocity and tickets closed can tell you if work is flowing through the system at all. A team whose velocity drops 40% in two sprints has a problem worth investigating. What they can’t tell you is whether the work matters.

Performance metrics are operational diagnostics. Long code review cycles, high defect escape rates, chronic on-call fatigue, these are real signals about process dysfunction. Treat them as process indicators, not as performance scorecards.

DORA is pipeline health. Lead time and deployment frequency tell you something real about delivery capability. Change failure rate and MTTR tell you something about resilience. But all four are downstream of the question that actually determines whether an engineering organization is performing: are you building things that move the business, fast enough to matter?

None of the standard metrics reach that question. The reason is structural. They’re easy to collect because they live inside the engineering toolchain. Anything that requires a connection to product outcomes, customer behavior, or business results is harder, and that difficulty is precisely why it tends not to get measured.

The missing layer

The data we reach for first, the data that’s easy to collect and easy to present, systematically hides what’s actually going on. The measurement layer that’s missing isn’t more engineering metrics. It’s a feedback loop that closes outside engineering.

A few things that would actually tell you something:

Outcome linkage. Can you connect a shipped feature to a measurable change in user behavior or business results? Not “we shipped it” but “after we shipped it, the thing it was designed to move, moved.” This requires instrumentation, a documented hypothesis before work starts, and a willingness to wait. None of those come naturally to sprint planning cadences. But without them, you’re measuring production, not impact.

Flow efficiency. The ratio of value-added time to elapsed time in your delivery process is more interesting than raw velocity. A feature that takes 12 weeks from idea to production, with only 2 of those weeks involving actual engineering work, has a flow efficiency of 17%. That’s a systems problem — and throughput metrics will never surface it. The bottleneck isn’t in the work; it’s in the waiting.

Technical health as a leading indicator. Complexity trends, dependency staleness, incident frequency — these are imperfect but directionally useful signals about whether your codebase is getting easier or harder to extend. Engineering organizations that ignore these tend to see velocity collapse right when the business needs them to accelerate. It’s not a coincidence.

Team capability over time. Are engineers growing? Are they retaining? Are they increasingly autonomous, or increasingly dependent on a few specialists? These lag badly as indicators, but they’re leading indicators of whether the organization will still be functional in two years. No sprint metric captures them.

Breaking the cycle

Swapping bad metrics for better ones doesn’t solve the problem. If you replace velocity with DORA and keep the same incentive structure, you’ll get gamed DORA metrics in six months. Nguyen is clear on this in The Score, once value capture takes hold, the answer isn’t a better score. It’s rebuilding the conditions under which people can reclaim their own values.

In an engineering context, that means a few things:

Separate diagnostic metrics from evaluation metrics. Deployment frequency is useful when a team uses it to understand their own pipeline. It becomes corrosive the moment it appears on a leadership report as a proxy for team performance. The same number. Completely different effect depending on who it’s for and what it drives.

Measure outcomes and accept the latency. This requires leaders to resist the urge to instrument everything that can be instrumented. Define what you’re trying to move before work starts. Measure it after you ship. Accept that the feedback loop is slower than a quarterly review cycle, and if your performance management cycle is shorter than your product feedback cycle, that’s the real problem to fix.

Make the system visible, not just the throughput. Flow efficiency, incident trends, and technical debt trajectories give teams and leadership a shared picture of systemic constraints. When the conversation shifts from “why didn’t you close more tickets” to “what’s blocking flow and what would it cost to fix it,” you’re at least asking questions that can produce useful answers.

Let the teams define the metrics for their own work. This one is underrated. The people closest to the work know what signals matter and which ones can be gamed. Metrics designed by a team to understand their own performance are completely different from metrics imposed from above to evaluate them. The former creates accountability. The latter creates the conditions for value capture.

Hold leadership accountable for outcome clarity. A lot of engineering metric gaming exists because the business hasn’t clearly defined what success looks like. If product leadership can’t say what a feature is supposed to change, and how they’ll know it worked, the engineering team will fill that vacuum with whatever proxy feels safest. Measurement quality is a leadership problem as much as a measurement problem.

The question the dashboard can’t answer

Look at whatever engineering metrics you’re currently tracking. For each one, ask: what decision would I make differently if this number were 20% better or worse? If the answer is “I’d evaluate someone’s performance differently,” the next question is whether you’re actually measuring what you want to optimize for, or just what’s available.

Nguyen ends The Score with the question the title comes from: is this the game you really want to be playing? Most engineering organizations never ask it. The dashboard is there. The numbers are green or red. The sprint review happens. Nobody stops to ask whether the game itself is worth playing.

Measuring the wrong things precisely is worse than not measuring. It gives you the confidence of data without the benefit of insight. The metric becomes the goal. The goal becomes the metric. The original purpose, building something that matters, gets quietly replaced. And everyone in the room is looking at the dashboard, wondering why the product still isn’t getting better.

Steve Whittle

Discussion about this post

Ready for more?