Steve Whittle

Counting Tokens

Steve Whittle — Thu, 07 May 2026 20:37:14 GMT

I’ve been building AI-powered features for the past several months, and one thing that became very clear very quickly is that token costs can’t be an afterthought. They’re a design constraint. When you’re running a healthcare platform with RAG pipelines, context-heavy prompts, and multiple LLM calls per user interaction, the bill is a direct reflection of how carefully you thought about what you’re actually sending to the model.

Most of the conversations I’ve seen about LLM cost optimization focus on the easy answer: pick a cheaper model. That’s the wrong starting point. Before you reach for a smaller model, you need to understand where your tokens are actually going. Otherwise, you’re optimizing blind.

First, understand your token budget

The single most useful shift in mindset is moving from “which model is cheaper?” to “how many tokens does this job actually need?”

Every LLM call has a token budget made up of roughly four buckets: system prompt, user/conversation input, retrieved context (if you’re doing RAG), and model output. Most engineering teams can tell you which model they’re on. Not many of them can tell you what percentage of their token spend is coming from each bucket, per endpoint, per feature, per user segment.

That’s the first thing to fix. Instrument your app to log tokens in and out at the endpoint level. You’ll almost always find that 20% of your calls are responsible for 80% of your token spend, and those hotspots are where optimization actually pays off. Spending a week shaving 10% off a low-volume admin workflow is a distraction. Shaving 30% off your highest-volume user-facing is much more meaningful.

The input side

Once you know where the tokens are going, the levers on the input side are fairly well-understood. The question is which ones are worth pulling.

Tighten your prompts. This sounds obvious but most system prompts accumulate cruft over time. Redundant framing, hedging language, politeness conventions the model doesn’t need. Audit yours with fresh eyes. “Summarize in 3 bullets” costs fewer tokens than a paragraph explaining what a good summary looks like. Directive beats descriptive.

Split your system prompt. If you have a large system prompt covering multiple capabilities, stop sending all of it on every call. Break it into a core section and optional sections, and load only what’s relevant for the task at hand. A prompt that’s always 3,000 tokens because it contains instructions for five features, when any given call only needs one of them, is something you can fix.

Tune your RAG retrieval. RAG is great, but undisciplined RAG drags enormous amounts of context into every call. Smaller chunk sizes, fewer retrieved chunks, a reranker to prioritize the most relevant passages, and deduplication to eliminate near-identical chunks all compound. You often don’t need half the document. You need the three paragraphs that actually answer the question.

Prune conversation history. Long chat threads accumulate fast. Rather than passing the full conversation on every turn, summarize earlier turns and keep a running state-of-the-world summary plus the last N interactions. The model doesn’t need a verbatim transcript of everything that happened three turns ago.

Preprocess before tokenizing. Strip content before it hits the model. Repeated disclaimers, HTML chrome, email signatures, standard headers. Anything that’s structurally consistent and informationally irrelevant to the task should be removed in code, not sent to the model and silently ignored.

The output side

Output tokens get less attention but they’re directly controllable.

Explicitly bound your responses. “At most 5 bullet points” or “under 150 words” is more effective than “be concise.” The latter is a suggestion; the former is a constraint. Use structured outputs (JSON, fixed schemas) wherever the output is machine-consumed. If you only need a label, a boolean, or a category, don’t ask for prose.

For user-facing answers, consider a layered approach: get a compact answer first, and only fetch a longer explanation on demand. Most users don’t need the full response most of the time.

The infrastructure layer

Beyond prompt engineering, there are a few architectural levers that can meaningfully change the economics.

Prompt caching. If your provider supports it, turn it on. Repeated system prompts and long shared documents shouldn’t be re-billed on every call. This is one of the highest-leverage, lowest-effort optimizations available. The work is mostly configuration, not engineering.

Semantic caching. For support questions, FAQs, and any domain with high query repetition, a semantic cache at the application layer can eliminate a large percentage of redundant model calls entirely. Vector similarity against past queries, reuse or lightly edit the cached response.

Model routing. Not all calls are equal. Route simple, well-defined tasks to smaller, cheaper models by default. Reserve the premium models for complex reasoning, high-stakes content, or tasks where quality differences are genuinely user-visible. The key is measuring this. A cheaper model that needs retries or follow-up calls can easily cost more than the premium model would have.

What doesn’t work

A few things that seem like good ideas and aren’t:

Adding “be concise” to every prompt and expecting large savings. It helps a little. The variance is high and the model will still expand if the rest of your prompt gives it room to.

Blindly truncating inputs. “Just take the first 4K tokens” sounds pragmatic but often drops critical context and degrades quality in ways that are hard to notice until something goes wrong.

Naive string compression: stripping punctuation, collapsing whitespace. This makes text less legible to the model and can actually increase token count due to how subword tokenization works. The model’s tokenizer and your intuitions about “shorter text” don’t always agree.

Swapping to a cheaper model without measuring tokens-per-task and quality first. A weaker model that requires multiple retries or produces output that needs downstream correction can easily end up costing more than staying on the better one.

A note on TOON and structured formats

There’s a newer format worth knowing about called TOON (Token-Oriented Object Notation). The idea is to strip JSON’s syntactic overhead, quotes, braces, repeated keys, by declaring keys once, like a header row. For flat, table-like data (user lists, product catalogs, RAG reference chunks, uniform agent outputs), benchmarks report roughly 30-60% prompt-token reduction with equal or slightly better accuracy on structured retrieval tasks.

It’s a real technique, but it comes with caveats. For deeply nested or irregular objects, the savings shrink or reverse. JSON is deeply embedded in LLM training data; TOON is a format the model hasn’t seen nearly as much, which introduces some fragility without fine-tuning. You’re also adding a serialization layer with converters, validation, and debugging in a nonstandard format that has real engineering cost.

The right framing for TOON: keep JSON in your APIs and storage. Use TOON only at the LLM boundary, and only when your prompts are heavy on structured data and the token savings are large enough to justify the complexity. It’s a specialized tool, not a universal one.

When not to optimize

For small documents, low-volume internal tools, or safety-critical tasks, it’s often better to overspend on tokens than to risk degraded quality or missed edge cases. Token optimization has a cost in engineering time, added complexity, and potential quality tradeoffs. The ROI only makes sense at the hotspots.

The teams that get this right aren’t the ones that optimize everything. They’re the ones that instrument first, identify the real cost drivers, and apply targeted effort where it makes the biggest difference.

Token spend is a design output. If you’re surprised by your LLM bill, the answer isn’t a cheaper model. It’s a closer look at what you’re asking the model to do, and why.

AI Agent Drift

Steve Whittle — Wed, 06 May 2026 02:35:35 GMT

I run a competitive research workflow on a regular cadence. I use the same prompt, same tools, and the same intent. I want to map the landscape, surface new entrants and track what the competition is up to. The first few runs looked good. Then something started shifting. Not the market, our competitors in this space don’t move that fast. The outputs were different. Different companies surfaced. Different framing of the same players. Different conclusions from largely the same inputs.

The agent hadn’t broken. It was still producing plausible, well-structured research. It just wasn’t producing consistentresearch.

The gap between an agent that performs well and an agent that performs reliably is not something I hear talked about very often.

Why the demo always works

LLMs don’t retrieve. They generate. Every output is a sample from a probability distribution shaped by the prompt, the context window, and the model version. In a single-turn interaction this rarely matters. The variation is small and the output is usually close enough.

Agentic systems change the equation. Outputs become inputs to the next step. Tools get called, state accumulates, and decisions made in step two shape what is possible in step six. Small variations compound. The agent that searched slightly differently in step two is now summarizing different sources in step four. It draws different conclusions in step six. This is not a bug in any conventional sense. It’s the nature of the beast.

A demo is a single run. It proves the agent can do the task. It says nothing about whether the agent will do the task the same way tomorrow, after a prompt change, or after the underlying model gets quietly updated by the provider.

Reliability is not following capability

Recent research makes this concrete. A January 2026 simulation study from arXiv (arxiv.org/abs/2601.04170) defines agent drift as the progressive degradation of behavior, decision quality, and coherence over extended interactions. It identifies three distinct forms: Semantic drift is where outputs deviate from original intent. Coordination drift is where multi-agent coherence breaks down. Behavioral drift is where the agent develops unintended strategies over time. Using theoretical modeling across simulated enterprise workflows, the study projects that unchecked drift could lead to task success rates dropping by over 40% and human intervention requirements tripling. These are projected figures from simulation, not measured production outcomes. But the underlying framework for why the failure mode compounds rather than plateaus is well-constructed.

A large-scale empirical survey from UC Berkeley, Stanford, UIUC, and IBM Research (arXiv:2512.04123) gives the clearest picture of how practitioners are responding. Of 306 practitioners surveyed, 68% keep their deployed agents to at most 10 steps before human intervention. 70% adjust the prompts rather than fine-tuning the model. 74% rely primarily on human evaluation. The researchers frame this as a deliberate paradox: reliability is the top development challenge, yet agents are reaching production. The resolution is that teams are not waiting for the reliability problem to be solved. They ship by limiting what agents can do. Constrained autonomy, sandboxed environments, internal deployment first. It works. But it’s a workaround, not a solution.

Not all agents carry the same risk

Agents can be broken down into two main categories. The risk profile for each is different.

Bounded agents are invoked for a discrete task, produce an output, and hand off to a human or downstream process. Example include: A Cursor session writing a function. A one-shot document summary. The scope is defined. The output is reviewable. Failures are localized. These carry constant risk over time that is largely tied to human review.

Ambient agents run continuously and make ongoing judgment calls without a hard stop. For example, inbox triage or continuous competitive monitoring. Basically any workflow where the agent decides what matters and acts on it repeatedly, without a human checkpoint between decisions.

My competitive research workflow sits between these two. It is repeatable rather than truly continuous, but the expectation of consistency is the same. When I run it on Monday and again in three weeks, I expect the differences in output to reflect differences in the market, not differences in the agent. That’s not what happened.

McKinsey’s 2025 global survey found that 62% of organizations are at least experimenting with AI agents. Only 23% are scaling one in production. The gap between experimentation and scale is not a capability gap. It is a trust, observability, and governance gap.

If you’re running long-horizon agents, here’s what helps

The research and the emerging vendor landscape have converged on a set of mitigation approaches. None of them eliminate the problem but they can reduce risk.

Context management. One of the least visible failure modes in long-running agents is context drift. As conversation history grows, reasoning quality degrades before you ever hit a context limit. The industry has settled on episodic consolidation: periodically compressing older context into structured summaries while preserving recent and relevant state. The Agent Drift paper identifies this as one of three mitigation strategies with the strongest theoretical grounding. Anthropic now ships a native compaction API that automates the loop.

Uncertainty-aware memory. A January 2026 paper from Salesforce AI Research calls the core failure mechanism in long-horizon agents the Spiral of Hallucination. A small grounding error in an early step gets committed to the agent’s context. It then becomes a false premise for every subsequent step. Standard self-reflection does not reliably catch this. The model has already accepted the error as ground truth. The proposed fix flags low-confidence steps before they propagate and triggers correction only when needed. Early results showed meaningful reliability improvements on multi-step benchmarks. This is early research. But it is getting at the cause rather than the symptom.

Checkpointing and interrupt design. Orchestration frameworks like LangGraph have built explicit checkpointing into their execution model. Agents are defined as directed graphs with typed state and hard interrupt points. A human can review, approve, or reset to a known-good checkpoint at any of those points. This converts a brittle autonomous system into a collaborative one. Carnegie Mellon benchmarks published in late 2025 found that leading agents complete only 30-35% of multi-step tasks successfully. This shows that uninterrupted autonomous execution is not the right default for complex workflows.

Golden dataset evaluation. This approach maps most directly to my competitive research problem and our product work. Create a set of representative inputs with human-verified expected outputs. Then run your agent against that dataset on a schedule or before any prompt change goes to production. AWS introduced this at re:Invent 2025 with the general availability of Bedrock AgentCore Evaluations: 13 built-in evaluators, CI/CD pipeline integration for pre-deployment gates, and continuous online evaluation against live production traffic. A demo showed the service detecting tool selection accuracy dropping from 0.91 to 0.3 in production. Without continuous measurement, that degradation is invisible.

Pushpay documented a real production implementation of this pattern. Their golden dataset covers over 300 representative queries with validated responses. It is continuously curated from actual user interactions and fed into an engineering dashboard. The key word is continuously. A golden dataset that does not evolve with your actual workload tests against past state not current state.

Beyond AWS, the commercial tooling has matured fast. Braintrust ties production traces and offline experiments to the same scorer library. A production regression automatically seeds the next test cycle. LangSmith integrates human annotation queues with trace replay, letting engineers convert production failures into evaluation cases. Arize offers always-on drift detection at the session and span level. For teams with HIPAA or data residency constraints, Langfuse is the strongest self-hosted open-source option. It was acquired by Clickhouse in January 2026, but the open-source codebase remains active.

None of this is free. Building and maintaining a golden dataset requires human judgment to define what “correct” looks like for open-ended tasks. That is genuinely hard when correctness is partly subjective. Dataset rot is a real risk. The infrastructure to run evaluations continuously has real cost. The tooling can solve the infrastructure problem but the curation problem is still yours.

For my competitive research workflow, the approach is well-suited. The expected output structure is defined even if the specific content varies. I know what a well-formed competitive analysis looks like. I can score for completeness, source coverage, and structural consistency without specifying exact content in advance. That is an easier evaluation target than most ambient agent tasks.

The durability problem

The industry has gotten very good at demonstrating what agents can do. It has not gotten as good, so far, at ensuring they keep doing it the same way.

Gartner projects that 40% of agentic AI projects will fail by 2027. Poor risk controls are cited as a primary cause. That figure will land as a surprise to anyone who has only ever evaluated their agents at a single point in time.

Narrow, monitored, bounded agents are viable today if you build them with that constraint in mind. Always-on autonomous agents are still waiting on better reliability science, better evaluation tooling, and more organizational honesty about the governance they require.

The question worth asking before deploying any agent is not “can it do the task.” It is whether you can tell when it starts doing the task differently than it did before. And whether you would know before your users do.

Coding interviews in the AI world

Steve Whittle — Sat, 02 May 2026 19:57:59 GMT

Are you testing the wrong thing?

The interview process is supposed to predict job performance. When the conditions of the interview bear no resemblance to the conditions of the job, you’re not predicting anything. You’re running a different experiment and hoping the results transfer.

They don’t.

What we actually test when we ban AI

The stated rationale for AI-free coding interviews is reasonable on its surface. We want to see how candidates think. We want to know if they actually understand the problem. We don’t want someone to paste a prompt and copy the solution.

The concern is legitimate. The conclusion is wrong.

Here’s what an AI-free whiteboard session actually measures: the ability to hold syntax, algorithm structure, and edge cases in working memory simultaneously, under pressure, without tooling, in an artificial environment. That is a real cognitive skill. It just isn’t the one that determines whether someone will perform well on your team in 2026.

The cognitive load of software development has shifted. The job now requires a different kind of thinking: decomposing a problem into chunks small enough to prompt effectively, reading AI output critically before accepting it, knowing when the model is hallucinating and why. A candidate who can recite a BFS implementation from memory but blindly accepts a subtly wrong AI-generated solution is a worse hire than one who forgets the exact syntax but immediately spots the flaw in what the model produced.

We’ve been measuring recall. We should be measuring judgment.

The industry has figured this out

In October 2025, Meta began rolling out an AI-enabled coding interview that replaces one of the two traditional coding rounds at the onsite stage. The internal framing is revealing. The format was designed to be more representative of the actual developer environment and also to make LLM-based cheating less effective.

The problems Meta uses are designed with AI assistance in mind. They’re harder than a traditional coding question. The bar for what a candidate is expected to produce is higher precisely because they have help. Candidates work in a multi-file codebase they didn’t write and have to understand quickly. Prompt-and-paste fails immediately in that environment because understanding the existing architecture is the prerequisite for everything else.

CodeSignal launched AI-assisted coding assessments in May 2025 with a feature that matters more than the AI access itself: a full transcript of every candidate-AI interaction alongside a session replay. You’re not just seeing what the candidate produced. You’re watching how they think.

HackerRank moved in the same direction. Candidates work with AI tools in a controlled environment and interviewers get a detailed view of the problem-solving process, not just the output.

What an AI-free interview cannot show you

When I think about what I actually need to know about a candidate, four things matter that a no-AI interview cannot surface.

Prompt quality as a diagnostic. The way someone frames a problem for an AI is a direct readout of how they think about the problem. Vague prompts reveal vague thinking. A candidate who writes “fix my function” tells you something different from one who writes “this recursive function is hitting a stack overflow on inputs above n=1000, here’s the current implementation, what’s the likely cause.” The second candidate has already diagnosed the problem. They’re using the AI to confirm and implement. That’s engineering judgment.

Verification instinct. One E7 candidate at Meta watched Claude Sonnet repeatedly hallucinate on a maze problem. The question isn’t whether the AI was wrong. The question is whether the candidate caught it. Did they know what correct looked like before the model answered? Did they push back? A candidate who accepts wrong output without question is a risk that an AI-free interview will never expose, because you never gave them an AI to accept.

Task decomposition. Candidates who performed well in Meta’s AI-enabled format guided the AI incrementally rather than asking for wholesale solutions. One successful candidate described her approach: start with the core logic as a single function, review it, then build out from there. That instinct, to keep the scope small enough to verify at each step, is exactly how good engineers approach complex problems. It’s invisible in a no-AI environment because there’s nothing to decompose for.

Communication under ambiguity. Meta’s internal evaluation criteria for this round includes a phrase that functions as an answer key: “Should use AI, but need to show you understand the code. Explain the output. Test before using. Don’t prompt your way out of it.” That is a rubric for thinking, not for tool use. The candidate is being evaluated on whether they can narrate a reasoning process in real time, hold a conversation with an interviewer while working with an AI assistant, and remain the accountable decision-maker throughout. That skill matters on the job. AI-free interviews don’t test it.

The gaming problem is real but solvable

The obvious objection is that AI-assisted interviews are easier to game. A fast prompter with shallow understanding can look strong if the evaluation is just the output.

That’s true. But it’s a design problem, not a fatal flaw.

Interaction transcripts solve most of it. When you can see the full sequence of what a candidate asked for and how they responded to what they got, shallow prompters reveal themselves quickly. They ask for too much at once. They accept the first answer without testing it. They can’t explain the code when asked.

Multi-file codebases with staged checkpoints solve the rest. A problem that requires understanding existing architecture before making any change can’t be solved by pasting a description into a chat window. The AI doesn’t have the context. The candidate has to build it, which means they have to understand it.

The gaming risk in no-AI interviews is equally real and far less visible. A candidate who has memorized LeetCode patterns looks identical to one who genuinely understands algorithms. At least AI transcripts expose the reasoning process. A whiteboard session shows you the answer. A session replay shows you how the candidate thinks.

What to change before your next hire

None of this requires overhauling your entire process. Three things make the difference.

Redesign the problem before you change the rules. A standard LeetCode question with AI access is still a bad question. The problem needs to be complex enough that AI assistance is a navigation tool rather than a solution dispenser. Ambiguous requirements, existing codebases, staged checkpoints. Problems designed so that understanding is the prerequisite for prompting.

Make the AI interaction visible and gradable. If you’re using a platform, use one that captures transcripts. If you’re running your own interviews, ask candidates to narrate their prompts out loud and explain what the AI gave them before they act on it. That narration is the interview.

Keep a short no-AI segment with a clear purpose. Baseline fundamentals still matter. A candidate who can’t read a stack trace or reason about complexity without assistance is a real risk. A focused no-AI segment to test that floor is legitimate. Don’t treat that floor as the whole evaluation.

The prediction model you’re running

Every hiring process is a prediction model. The inputs are interview signals. The output is a forecast of job performance. When the inputs don’t reflect job conditions, the model is broken.

Running AI-free coding interviews for engineers who will spend their careers working alongside AI tools is like running driving tests with no steering wheel because you want to assess balance. The rationale sounds defensible. The instrument is wrong.

The industry has started correcting. The question is whether your process has.

The bill is coming due - AI coding vendor lock-in

Steve Whittle — Tue, 28 Apr 2026 02:15:26 GMT

If you’ve been using AI coding tools over the past two years, you’ve been getting a great deal. Frontier model access embedded in your IDE, powering your agents, running in your CI pipelines — for prices that don’t actually cover what it costs to serve you.

That’s not a bug. It’s a strategy. And strategies change.

This post is about what happens when the economics of AI-assisted coding get repriced, why that repricing is likely in the next 12 to 24 months, and what your engineering organization should be doing right now before the bill arrives.

The Free Ride Won’t Last

According to internal projections reported by the Wall Street Journal, OpenAI does not expect to reach profitability until 2030. Anthropic projects reaching positive free cash flow by 2027 or 2028. Both companies are growing revenue at extraordinary rates. Anthropic recently reported annualized revenue exceeding $30 billion. But revenue and profit are not the same thing, and right now these companies are very far from the same thing.

The structural problem is inference compute. OpenAI spent roughly 50% of its revenue on inference costs alone in recent years, with training costs pushing total expenditure well above what comes in. Every token you generate costs real money in GPU time. The pricing you see at the API does not reflect what it actually costs providers to serve those tokens.

Open-weight models of comparable capability are anywhere from 17x to 18x cheaper than Anthropic’s API pricing and are from providers who are covering their costs and making margins. That isn’t an indictment of Anthropic’s business model. It reflects real differences in model capability, trust, tooling maturity, and enterprise positioning.

The narrative you’ll hear most often is that inference costs will keep falling and everything will work out. That narrative has been repeated for three years. Inference costs for frontier models have not followed the curve that optimists projected, partly because each new model generation is larger and more capable than the last, which resets the compute baseline. Lower prices on last-generation models don’t help you if you need current-generation capability.

The current pricing environment is a competitive land-grab. It’s not sustainable.

The IPO Pressure Cooker

Both Anthropic and OpenAI are moving toward public markets. Anthropic has engaged IPO counsel and is reportedly discussing an offering as early as Q4 2026, targeting a raise exceeding $60 billion. OpenAI is targeting a similar timeline at a valuation approaching $1 trillion.

Private investors fund growth stories and tolerate long paths to profitability. Institutional fund managers running discounted cash flow models do not. The S-1 filing will contain actual unit economics for the first time. Analysts will model gross margins. Price-to-earnings ratios will matter in a way they don’t when you’re raising from VCs.

This creates a specific incentive. Both companies have strong motivation to show margin improvement before listing, not after. The levers available are cost reduction (hard, because compute costs are driven by usage and model scale) and price increases (a decision that can be made in an afternoon).

There’s also the Inference Trap: You build the best model, usage surges, inference compute explodes, and you face a forced choice between throttling users, raising prices, or cannibalizing the training compute you need to stay competitive. Anthropic experienced five major platform outages in a single month in early 2026. Claude Code users reported burning through usage allocations far faster than the pricing implied. That’s the Inference Trap operating in real time.

The combination of IPO pressure and Inference Trap dynamics makes a repricing event not just plausible but structurally likely. The question isn’t whether it happens. It’s whether you’re ready when it does.

What You’ve Actually Built On

Most engineering teams believe they have less AI vendor lock-in than they actually do. The assumption is: swap the API key, update the model name, done. There’s a lot more to it than that. It gets worse the deeper into agentic workflows you go.

The lock-in profile varies significantly by use case:

IDE-embedded tools like Copilot, Cursor, and Claude Code represent the shallowest lock-in. You could switch IDEs or model backends with little effort. But don’t underestimate soft stickiness. Developer muscle memory, .cursorrulescustomizations, team-shared system prompts, and workflow integrations all add switching friction. A price increase here hits developer productivity budgets, which are visible and politically sensitive.

Agentic coding workflows are where real lock-in begins. Agentic systems don’t just call a model — they build scaffolding around it. System prompts are tuned to a specific model’s personality and failure modes. Tool-calling schemas are optimized for how that model interprets them. Retry logic and output parsing are calibrated to observed behavior. When you switch models, that scaffolding doesn’t transfer cleanly. You’re not changing a config parameter. You’re running a re-evaluation campaign against your own codebase. Industry data suggests migration costs when provider lock-in forces a move average over $315,000 per project, and that figure reflects situations where teams already had some abstraction in place.

CI/CD and automated pipelines carry the highest risk. These are production systems with determinism requirements. Prompts optimized for one model may produce subtly different outputs on another. Those outputs can look similar enough to pass manual inspection but break downstream parsers and validation steps. Model version pinning provides a false sense of stability because providers deprecate models with 90 days’ notice, and there is no guarantee of behavioral equivalence between versions. The fundamental problem is that you cannot treat an LLM call in a production pipeline the same way you treat a deterministic function call. When you switch models, you have to prove the pipeline still works. You cannot assume it.

Open Source is a real option, but it has gaps

The obvious response to pricing risk is to use open source models, self-host, and pay for compute instead of markup. That path is more viable than it was 18 months ago but has real gaps that tend to be underestimated.

The capability gap has largely closed on many dimensions. Open models now match or surpass closed models on knowledge benchmarks, mathematical reasoning, and graduate-level science. The gap that remains is concentrated where it matters most for coding: production-level agentic tasks, multi-step software engineering, and complex tool use. On SWE-bench Verified, the most practically meaningful coding benchmark, the best open models are within a few points of frontier closed models. That gap is still an issue at the tail of task complexity.

The price differential is big. DeepSeek V3.2 is available at roughly $0.28 per million input tokens. Claude Opus 4.7 is $5.00 per million input tokens. That’s a 17x difference. For high-volume workloads the economics are compelling even accounting for operational overhead.

But here’s what the open source advocates undersell: switching models is not the same as switching model providers. The scaffold matters as much as the model. Real-world benchmarks show a 22-point swing on the same task with the same model when you change the agent scaffold and tooling. Switching models requires re-validating your entire system, not just verifying the model output looks reasonable.

The operational burden of self-hosting is a real cost transfer. Inference infrastructure, model serving with tools like vLLM or Text Generation Inference, GPU provisioning, update cadence, and security patching all fall on your team. For most organizations without dedicated ML infrastructure experience, this isn’t a savings. It’s a new operational surface area.

There’s also a geopolitical dimension worth naming directly. The strongest open models right now (DeepSeek, Qwen, Kimi) are Chinese-developed. For organizations with data sovereignty requirements, government contracts, or security-sensitive codebases, the lineage of a model matters. This isn’t a reason to dismiss these models outright, but it’s a factor that belongs in your architecture decision.

The Protocol Layer Is Your Best Friend

The most practical near-term lever against lock-in isn’t switching to open source. It’s building an architecture that makes switching possible.

Model Context Protocol (MCP) is the most significant structural development here. Originally developed by Anthropic and then donated to the Agentic AI Foundation (AAIF). This foundation was co-founded by Anthropic, Block, and OpenAI. MCP has achieved something rare: genuine cross-industry adoption. OpenAI abandoned their proprietary Assistants API and adopted MCP. Google DeepMind, Microsoft, and AWS are all on board. When direct competitors converge on a shared infrastructure standard it signals inevitability.

MCP decouples the agent-tool connection layer from the model layer. Your integrations with databases, APIs, filesystems, and external services are built once against the MCP standard and survive a model swap. That’s the right layer to standardize at.

Pair that with an LLM Gateway such as LiteLLM or Portkey, middleware that abstracts provider-specific API differences behind a single interface, and you get a system where the model backend is genuinely swappable without rebuilding your application logic. The marginal complexity cost of adding this abstraction early is low. The switching optionality it creates is high.

Be honest about what MCP doesn’t solve though. The protocol handles tool integration, not model behavior. When you swap models, your prompts still need re-validation. MCP can also consume 40-50% of available context window before any actual work begins, which creates real production tradeoffs. Standards help. They don’t eliminate the work.

What You Should Do Today

The cost of acting on this now is low. The cost of acting after a pricing shock is high.

For IDE tools: Evaluate whether your current tooling is model-agnostic or model-bundled. Prefer tools that let you swap backends. Baseline your developer productivity metrics now. You need a measurement baseline before any changes hit, not after.

For agentic workflows: Add an LLM Gateway from the start of any new project. Keep your agent orchestration layer architecturally separate from your model API calls. This is the single highest-leverage structural decision you can make. Build evaluation suites against your own codebase, not generic benchmarks. Generic benchmarks tell you how a model performs in the abstract. Your eval suite tells you whether you can safely swap models in your specific system.

For CI/CD pipelines: Treat every LLM call as a third-party dependency with explicit versioning, SLA monitoring, and a tested fallback path. Design for graceful degradation. What does the pipeline do when the model endpoint is slow, unavailable, or has been updated? This should be a documented decision, not an untested assumption.

Across all use cases: Audit your current AI spend and its concentration across providers. Most teams have no idea what this number is. Monitor the IPO timelines. The S-1 filings will be the first time the public sees actual unit economics from these companies, and they will move the conversation. Build internal familiarity with at least one open-weight model family. Even if you never deploy it in production, that knowledge reduces the information asymmetry in any future pricing negotiation.

A Strategy Note, Not a Panic Note

The goal here is not to abandon frontier models. They are genuinely better at certain tasks, the tooling ecosystem around them is more mature, and for many use cases the productivity gains justify whatever they end up costing.

The goal is not to be surprised. More specifically, the goal is to avoid being in the position of needing to move urgently with no alternatives evaluated and no time to build them.

Engineering organizations that have done the architecture work to reduce switching costs will have options when prices move. They’ll be able to make a deliberate choice between absorbing the increase, substituting a capable alternative, or negotiating from a position of real leverage. Organizations that haven’t done this work will face a different situation: urgent need, unknown switching cost, and a vendor who knows it.

The bill is coming. The amount is unknown. The only variable you control is how ready you are to pay someone else instead.

Replacing managers with an AI World Model

Steve Whittle — Fri, 24 Apr 2026 15:22:03 GMT

Jack Dorsey and Roelof Botha published “From Hierarchy to Intelligence” on March 31, 2026. It is a serious piece of thinking. The historical framing is sharp, the diagnosis of why hierarchies exist is largely correct, and the argument that AI changes the information-routing constraint is real.

But the essay is about one-third of what managers actually do. It treats that one-third as the whole job, removes the people doing it, and calls the problem solved. The other two-thirds are still there. They just don’t have anyone doing them anymore.

To understand that better, we need to be precise about what managing actually is.

Three Clusters, Not One

In 1973, Henry Mintzberg published The Nature of Managerial Work, based on direct observation of what managers do with their time. His finding was that managerial work falls into three clusters: informational, interpersonal, and decisional.

The informational cluster is what Dorsey and Botha are talking about. Managers monitor what’s happening, disseminate that information across the organization, and represent the team to the outside world. This is the information routing function. It’s the layer that hierarchy was built to support, and it’s the layer that AI can now perform continuously, at scale.

That’s true. But it’s only one of three.

What AI Cannot Route

The interpersonal cluster covers things that depend on trust, relationship, and human accountability. The leader motivates and develops people, figures out who has potential and what they need to grow, and has the difficult conversations.

The liaison role involves building relationships across organizational boundaries, the kind of connective tissue that makes cross-functional work actually work. The figurehead role is about legitimacy and accountability. When something goes wrong, someone needs to be responsible in a way a system cannot be.

A telling data point from the Block restructuring itself: current and former employees told The Guardian that roughly 95% of AI-generated code changes still require human modification. This is in a remote-first, highly digital, machine-readable organization, exactly the environment that Dorsey describes as most amenable to this model. The humans are still in the loop not because the information system failed, but because the work itself requires judgment that isn’t in the data.

The decisional cluster is where this becomes even clearer. Mintzberg’s entrepreneur role involves sensing and acting on opportunities that aren’t visible yet. A world model, by definition, can only reflect what has already happened. It cannot tell you what to build that doesn’t exist yet. The disturbance handler role is about responding to crises and genuinely novel situations, exactly the circumstances where pattern-matching on historical data is most likely to fail. Resource allocation and negotiation involve competing interests, trust between parties, and judgment under uncertainty. In regulated industries, financial services, healthcare, any domain with fiduciary obligations, you can’t delegate these decisions to a system regardless of how good it is.

The Risks Worth Naming

Beyond the functional gaps, there are a few things worth noting:

Post-hoc rationalization?. Block cut 40% of its workforce in February 2026, before the essay was published. The stock jumped roughly 22%. Botha, who co-authored the essay, sits on Block’s board. Morgan Stanley upgraded Block to overweight after the cuts. Goldman Sachs raised its price target. It is fair to ask whether the intellectual framework followed the business decision or preceded it. That doesn’t make the argument wrong, but it does mean the incentive to believe the argument is quite strong for the people making it.

The flat structure graveyard. Zappos tried holacracy. Valve famously ran without managers. The Spotify model has been widely adopted and widely struggled with. These experiments didn’t fail because the idea was wrong in theory. They failed because removing formal structure doesn’t remove the need for coordination. It just moves coordination into informal channels, where it becomes invisible, political, and dependent on whoever has the most social capital. The information routing problem gets solved. The interpersonal and political problems get worse.

Data completeness. A world model built from Slack threads, Jira tickets, pull requests, and performance metrics reflects what was written down. A significant fraction of organizational knowledge is never written down. It lives in the judgment calls that didn’t make it into a doc, the context a senior engineer carries about why a system was built the way it was, the reason a decision was made three years ago that nobody remembers to explain to new people. The model sees the artifact. It doesn’t see the reasoning behind it.

Data quality and drift. This one is distinct from completeness and is arguably more dangerous. Information that was accurate when it entered the system becomes stale. The system continues to present it with the same authority as fresh data. Decisions get made on information that was true six months ago and isn’t anymore. You don’t see the error at the time. It shows up later, in ways that are very hard to trace back to their source.

This is a documented, recurring failure in knowledge management systems generally. It’s not theoretical. Platforms like Guru have built their core product differentiation specifically around the knowledge freshness problem because the industry learned, repeatedly, that drift is the default. Small errors accumulate in decisions that each look reasonable in isolation, until something downstream breaks in a way nobody can explain.

Regulatory reality. For companies operating in financial services or healthcare, the question of whether AI can replace decision-making isn’t just organizational. It’s legal. Explainability requirements, fair lending law, fiduciary duty, these don’t care how good your model is. Decision accountability cannot be delegated to a system in regulated domains. This alone may prevent this type of World Model from being used is certain industries.

Where This Is Actually Going

None of this means the Dorsey/Botha thesis is wrong. I just think it’s incomplete.

The informational cluster is being automated. That is a real and permanent shift. Managers who spent the majority of their time aggregating context, relaying status, and maintaining alignment across teams are already less necessary than they were. That part of the argument makes a lot of sense.

What’s more interesting is what happens to the other two clusters as this plays out. If information routing gets absorbed by AI systems, the interpersonal and decisional work doesn’t disappear. It becomes more visible and it’s value should be more obvious. Managers who survives this transition are the ones who were always doing the work that was hardest to put on a job description.

The open question is whether that work changes in character as organizations become more AI-instrumented, or if it simply becomes more prominent because everything else has been stripped away. Does managing people become fundamentally different when the coordination layer is automated? Or does it turn out that the relational, developmental, and judgment-intensive work was always the real job, and the information routing was just the overhead we confused for the substance?

I don’t think anyone knows yet. Block’s Q1 2026 results will be a first data point. If they hit $12.2 billion in gross profit with 40% fewer people, the thesis gets harder to argue with. If they don’t, their Roman Army comparisons will age badly.

Either way, the question is worth asking more carefully than “can AI replace the org chart.” The org chart was never the point. It was just the structure we built to solve three problems at once. AI can solve one of them. The other two? Let’s see.

Is This Really What You Want to Measure?

Steve Whittle — Wed, 22 Apr 2026 23:19:27 GMT

I’m reading C. Thi Nguyen’s new book, The Score, and one idea in it keeps coming back to me. He calls it value capture,the process by which a rich, meaningful goal gets quietly replaced by the metric you were using to track it. You start measuring something because it points toward what matters. Then, gradually, the measurement becomes what matters. The original goal doesn’t disappear. It just stops being the thing that drives decisions.

That’s not a philosophy problem. That’s every day in most engineering organizations.

What we actually measure

Let’s be clear about what passes for engineering measurement in most companies.

Output metrics are the default: story points, tickets closed, pull requests merged, lines of code written. They’re easy to collect, easy to visualize, and they feel like signal. The problem is they measure production, not value. A team can close 200 tickets in a sprint and ship nothing a customer cares about. Story points aren’t a unit of value, they’re a unit of negotiated effort, and that negotiation starts the moment someone decides to track them.

Performance metrics try to go one level deeper: code review turnaround time, sprint commitment hit rate, on-call response time, test coverage percentages. These are more interesting because they reflect process health. But they still measure fidelity to a process, not effectiveness of the work. A team can hit 95% sprint commitment every week by sandbagging estimates. Reviews can be fast because nobody’s actually reviewing.

Efficiency metrics are where most engineering organizations have landed recently, particularly DORA, the four-key-metrics framework: deployment frequency, lead time for changes, change failure rate, and time to restore service. DORA is genuinely useful as a diagnostic. The problem is what happens when it moves from a team-level health check to a leadership dashboard. Deployment frequency gets gamed by trivializing deployments. Lead time gets gamed by where you start the clock. You end up with a team deploying 15 times a day that is still six months from shipping anything meaningful.

Every one of these metric categories lives entirely inside the engineering system. None of them has a direct connection to whether the engineering organization is actually doing its job.

The game you didn’t know you were playing

Nguyen’s book makes a distinction that hits differently in an engineering context. He separates striving play from achievement play. In striving play, the goal is to engage fully, the process, the judgment, the craft. In achievement play, the only thing that matters is the score. Great games are designed so that chasing the score also produces striving. You can’t get good at chess by gaming the scoring system; you actually have to get good at chess.

Institutional metrics are the opposite. They strip out the magic circle, Nguyen’s term for the temporary, voluntary frame that makes game constraints feel meaningful rather than oppressive. In a board game, you accept arbitrary rules because you chose to sit down and play. In a work context, those rules aren’t arbitrary and they aren’t optional. The score follows you. It shows up in your performance review. It gets presented to the board.

What’s left, once you remove the magic circle, is a system that rewards achievement play. And engineers, who are, professionally, some of the best problem-solvers in any room, find the optimal path to the score. This isn’t cynicism. It’s a completely rational response to the incentive structure you built.

The consequences are predictable:

Velocity becomes sandbagging. The moment team velocity appears on a leadership dashboard, estimation inflates. Points expand to protect the team. After a few quarters, the number is politically stable and informationally useless.

Deployment frequency rewards triviality. If deploying frequently is the metric, the rational move is to break work into smaller pieces, not because small batches are better (they often are, but for different reasons), but because each deployment ticks the counter.

Commitment rates reward conservatism. Measure whether a team delivers what they promised and they’ll promise less. You’ll see consistently green dashboards and a team that’s quietly becoming slower.

Code review speed becomes rubber-stamping. If time-in-review is visible, reviewers learn to approve fast. Technical debt accumulates invisibly while the metric looks healthy.

This is Goodhart’s Law in action, once a measure becomes a target, it ceases to be a good measure. But Nguyen’s framing adds something important: it’s not just that the metric gets corrupted. It’s that people’s values get reshaped around it. The engineers optimizing for velocity aren’t lying. They’ve internalized the metric. The metric has become, for them, what good work looks like. That’s value capture. That’s the thing that’s actually hard to fix.

What these metrics are actually telling you

This is worth being precise about, because the answer isn’t “nothing.” These metrics have legitimate uses. The failure is the mismatch between what they measure and what leaders use them to decide.

Output metrics, velocity and tickets closed can tell you if work is flowing through the system at all. A team whose velocity drops 40% in two sprints has a problem worth investigating. What they can’t tell you is whether the work matters.

Performance metrics are operational diagnostics. Long code review cycles, high defect escape rates, chronic on-call fatigue, these are real signals about process dysfunction. Treat them as process indicators, not as performance scorecards.

DORA is pipeline health. Lead time and deployment frequency tell you something real about delivery capability. Change failure rate and MTTR tell you something about resilience. But all four are downstream of the question that actually determines whether an engineering organization is performing: are you building things that move the business, fast enough to matter?

None of the standard metrics reach that question. The reason is structural. They’re easy to collect because they live inside the engineering toolchain. Anything that requires a connection to product outcomes, customer behavior, or business results is harder, and that difficulty is precisely why it tends not to get measured.

The missing layer

The data we reach for first, the data that’s easy to collect and easy to present, systematically hides what’s actually going on. The measurement layer that’s missing isn’t more engineering metrics. It’s a feedback loop that closes outside engineering.

A few things that would actually tell you something:

Outcome linkage. Can you connect a shipped feature to a measurable change in user behavior or business results? Not “we shipped it” but “after we shipped it, the thing it was designed to move, moved.” This requires instrumentation, a documented hypothesis before work starts, and a willingness to wait. None of those come naturally to sprint planning cadences. But without them, you’re measuring production, not impact.

Flow efficiency. The ratio of value-added time to elapsed time in your delivery process is more interesting than raw velocity. A feature that takes 12 weeks from idea to production, with only 2 of those weeks involving actual engineering work, has a flow efficiency of 17%. That’s a systems problem — and throughput metrics will never surface it. The bottleneck isn’t in the work; it’s in the waiting.

Technical health as a leading indicator. Complexity trends, dependency staleness, incident frequency — these are imperfect but directionally useful signals about whether your codebase is getting easier or harder to extend. Engineering organizations that ignore these tend to see velocity collapse right when the business needs them to accelerate. It’s not a coincidence.

Team capability over time. Are engineers growing? Are they retaining? Are they increasingly autonomous, or increasingly dependent on a few specialists? These lag badly as indicators, but they’re leading indicators of whether the organization will still be functional in two years. No sprint metric captures them.

Breaking the cycle

Swapping bad metrics for better ones doesn’t solve the problem. If you replace velocity with DORA and keep the same incentive structure, you’ll get gamed DORA metrics in six months. Nguyen is clear on this in The Score, once value capture takes hold, the answer isn’t a better score. It’s rebuilding the conditions under which people can reclaim their own values.

In an engineering context, that means a few things:

Separate diagnostic metrics from evaluation metrics. Deployment frequency is useful when a team uses it to understand their own pipeline. It becomes corrosive the moment it appears on a leadership report as a proxy for team performance. The same number. Completely different effect depending on who it’s for and what it drives.

Measure outcomes and accept the latency. This requires leaders to resist the urge to instrument everything that can be instrumented. Define what you’re trying to move before work starts. Measure it after you ship. Accept that the feedback loop is slower than a quarterly review cycle, and if your performance management cycle is shorter than your product feedback cycle, that’s the real problem to fix.

Make the system visible, not just the throughput. Flow efficiency, incident trends, and technical debt trajectories give teams and leadership a shared picture of systemic constraints. When the conversation shifts from “why didn’t you close more tickets” to “what’s blocking flow and what would it cost to fix it,” you’re at least asking questions that can produce useful answers.

Let the teams define the metrics for their own work. This one is underrated. The people closest to the work know what signals matter and which ones can be gamed. Metrics designed by a team to understand their own performance are completely different from metrics imposed from above to evaluate them. The former creates accountability. The latter creates the conditions for value capture.

Hold leadership accountable for outcome clarity. A lot of engineering metric gaming exists because the business hasn’t clearly defined what success looks like. If product leadership can’t say what a feature is supposed to change, and how they’ll know it worked, the engineering team will fill that vacuum with whatever proxy feels safest. Measurement quality is a leadership problem as much as a measurement problem.

The question the dashboard can’t answer

Look at whatever engineering metrics you’re currently tracking. For each one, ask: what decision would I make differently if this number were 20% better or worse? If the answer is “I’d evaluate someone’s performance differently,” the next question is whether you’re actually measuring what you want to optimize for, or just what’s available.

Nguyen ends The Score with the question the title comes from: is this the game you really want to be playing? Most engineering organizations never ask it. The dashboard is there. The numbers are green or red. The sprint review happens. Nobody stops to ask whether the game itself is worth playing.

Measuring the wrong things precisely is worse than not measuring. It gives you the confidence of data without the benefit of insight. The metric becomes the goal. The goal becomes the metric. The original purpose, building something that matters, gets quietly replaced. And everyone in the room is looking at the dashboard, wondering why the product still isn’t getting better.

AI Coding and Context Tax

Steve Whittle — Wed, 22 Apr 2026 21:24:07 GMT

When writing code using AI assistance, it’s easy to generate a lot of code in a very short amount of time. As the development process goes on, even more code will be created. This gets complicated when it comes to maintaining and expanding the code base. Over time the person working on the code has to get back into that code to address bugs or add features. Getting to the point where you understand the code well enough, again, takes time. I have seen that while it’s easy to create a lot of code with AI, it’s not as easy to come back into that code after it’s been created.

One thing we need to keep in mind when using AI to generate code is that now there becomes a difference between writing the code and understanding the code. In pre-AI software development, where you needed to manually write the code, there was a certain understanding of what you were writing while you were writing it. With AI, that process goes away. The code is created, and you then need to understand that code and become familiar with it so that you can then make updates.

When you write the code yourself, your brain is creating an internal model of that system that you’re creating. When you step away from that code and come back, some of that that internal model, will persist. Interacting with something, rather than just reading it, tends to be retained in the brain longer. Since you’ve written the code yourself, it’s easier to come back into that code and continue developing. If you have not gone through that process, if you are reviewing code that was written by AI or by another person, then you’re establishing familiarity with the code each time without having that pre-existing internal model. This makes it more difficult and takes longer to come up to speed with that code so you can continue development.

Protecting Comprehension

The answer isn’t to use AI less. It’s to recognize that AI handles code generation but cannot handle comprehension. Comprehension doesn’t transfer. Each of the techniques below are just a mechanism for forcing you to build a genuine internal model of the code, not just review it from the outside. If you spend 2 hours building something with AI that would have taken 2 days manually and then spend 45 minutes re-establishing context every time you return to it, the break-even on that speed gain comes faster than you think.

1. The spec-first inversion

The conventional flow is: prompt → code → review. Reverse the order. Before asking the AI to generate anything, write a 3–5 sentence description of what the code should do and why. Not a formal spec — just externalizing your intent. That document becomes your re-entry anchor, and it forces the model to work from your understanding rather than producing something you have to reverse-engineer.

2. Treat code review as active encoding, not QA

When you review AI-generated code, the goal shouldn’t just be catching bugs — it should be deliberately building the internal model you didn’t get from writing it. Don’t rubber-stamp diffs. Walk the logic path. Rename things that don’t reflect your mental model. The review is the learning; skip it and you’re adding to the context debt, not paying it.

3. The commit message as cognitive snapshot

Write commit messages as if you are explaining the change to yourself six months from now — not what changed, but why and what you understood about the system at the time. This is cheap, async, and gives future-you a scaffold that’s tied to the exact moment of maximum context. Git log becomes a context recovery tool, not just a change log.

4. AI-assisted re-entry

Use AI to rebuild context, not just create code. Paste the module into a fresh session and ask: “Explain what this does, what it assumes, and what would break if X changed.” That AI-generated explanation — corrected where wrong — becomes your working model. You’re using the same tool that created the debt to help pay it down.

What we’re seeing with AI assisted software development is that it is not a panacea. AI is changing the mental effort required to maintain understanding over time. It’s not just about how quickly we can create the code. It’s about being able to sustain that comprehension. The key takeaway here is to use AI to accelerate coding, but don’t outsource the understanding of that code. It needs to be understandable and comprehensible over time by humans after the code is written.

The software development bookshelf

Steve Whittle — Sun, 19 Apr 2026 23:25:44 GMT

I’ve been doing some thinking about how software development has changed since the introduction of AI for code generation. Having been through a lot of these changes recently with building a new product, I think that it’s become a lot more interesting, but it’s also highlighted some important parts that may not have been given the importance they deserve.

The way I look at software development now is like a bookshelf. You have bookends on either end with books in between. In this metaphor, the LLM or the AI system is the books. This is what generates actual code, whether you’re using Claude Code or OpenAI, Google or anything else. The bookends are the parts that hold up that code.

I see the bookends as follows:

The first bookend is the definition of what you actually want to build. This was always important in traditional software engineering because engineering resources were scarce. The amount of time to develop something was fairly long so there was a lot of analysis and discussion and definition ahead of time, sometimes too much. You would spend months talking about something only to realize that you’ve missed the window and you didn’t need it anymore. That kind of fell a little bit by the wayside as vibe coding came about. People could just type in a one- or two-sentence prompt and get back something that was kind of what they wanted, but not really. The second bookend is the verification and the validation. You’ve given it clear instructions but AI is a probabilistic system so you are going to get something that’s probably very close to, but not exactly what you need. That verification process requires a human to look at the output with human-level context. They needc to look at what’s been built and validate that it did the right thing, that it built what they thought it was going to build. The human must then make changes, maybe small changes, maybe big changes, to get that code or that application really production ready. Without those bookends, if you look at just the way some people have approached coding, you end up with garbage in, garbage out. Without the human aspects on either side, then the AI is going to build something but it may, in fact, probably will not build exactly what you want.

So with all this you might be thinking, well, what are we really getting with AI coding? In a short: speed. We’re also getting a fair amount of accuracy.

If you’ve properly defined the problem, the AI system may generate thousands, tens of thousands, hundreds of thousands of lines of code. Without AI just the physical typing of that can take weeks or months, also taking into account things like typos and people getting tired. This is oversimplifying, but it is the manual work that needs to be done to build an application. What AI is doing is automating that manual piece. Automation allows you to do the right thing really quickly but it also allows you to do the wrong thing very quickly. Even though it can do a lot of that bulk work, we still need that validation piece.

When we talk about the bookend work that humans need to do in this process, this requires a certain amount of domain knowledge, skill, and overall knowledge of how systems work. This generally requires someone who has experience.

The problem that we’re starting to run into is if we look back again at our bookshelf, the book portion, the actual typing of code and debugging typos and logic issues was typically something that was done by entry-level software developers. By doing this, they would gain experience and they would be able to move up to be more senior software developers.

The challenge that we’re going to face is that, with the work that may typically have been done by junior developers now being done more by AI, you end up with a need for fewer junior developers. We’re starting to see now that companies are not hiring as many of those. That is short-sighted because when the current senior software developers are gone, retired, left the company, etc., you don’t have enough of those junior software developers to move up into those senior roles.

This is a problem that has not yet been solved. There needs to be a better way to address this because we still need a pathway for people to enter the industry to do those bookend tasks. It will be interesting to see how the industry evolves around that.

The latest and greatest coding models get better and better at generating code all the time. This is great, and when you look at code generation from this point of view, the frontier models are still helpful. However, a lot of the code that’s going to be generated, assuming you’ve clearly documented the problem and you’re validating the results, may not require as powerful of a model as you might think. This will depend on the code that you need to generate. If you’re doing something like medical diagnoses or tackling physics problems, or dealing with vast amounts of data then yes, you want a very high-power model. I suspect that a simpler model may do just as well in many cases, assuming you have those bookends in place. I don’t have hard data on this but I will be interested to see how things evolve.

The Ticket is Dead, Long live the Spec

Steve Whittle — Fri, 17 Apr 2026 15:10:39 GMT

As AI-assisted coding gains wider adoption, we need to look at the tools we use to manage the development lifecycle. Historically, the software world has lived in ticketing systems like Jira. This made sense when code was the bottleneck; Jira allowed us to define, plan, and track the manual labor of writing code. When the “act” of coding required significant time and resourcing, tracking velocity and story points provided a necessary although imperfect visibility into progress.

However, with AI now drastically reducing the effort required to generate code, our legacy metrics—points, velocity, and story size—are starting to break down. We have an opportunity to rethink the development process entirely.

While bugs and features both require attention, the impact of AI is most profound in the feature development process. Automation is a double-edged sword: it helps you do the right thing quickly, but it also helps you do the wrong thing faster. To avoid the latter, we must shift our focus to three core phases: Specification, Development, and Verification.

Specification: Raising the Bar

In an AI-augmented workflow, the “size” of a ticket changes. We are no longer limited to small, rigidly defined bits of functionality. We can now deliver much larger chunks of work in a single pass. This shift, however, raises the bar for our upfront definitions.

In a Jira context, the ticket should serve primarily as a tracking mechanism that anchors the specification—whether that spec lives in the ticket itself or is linked via Notion or Confluence. The goal is to create a “source of truth” that defines the feature with enough clarity that an AI can execute it and, more importantly, so a human can verify the result. The planning and point-assigning “middle” of the process is becoming less relevant; the real work is now happening in definition and verification.

Development: Capturing the Thought Process

We cannot treat AI development as a “black box” where a spec goes in and code comes out. While AI-assisted coding frees engineers from the drudgery of syntax and manual typing, the engineer’s role as a “guide” is more critical than ever.

We need to capture the thought process behind the implementation. Tools like Claude Code and Cursor already allow us to export session data that details how an engineer navigated a problem, where the AI stumbled, and how it was corrected. By automatically appending these session summaries to Jira tickets, we can maintain a complete audit trail of the engineering logic without adding manual overhead for the developer.

Verification: The New Bottleneck

If AI has removed the bottleneck of writing code, it has moved it to verification. The temptation with AI tools is to move fast and “test it in production” to see if it works. This is backward.

Because of the potential for AI hallucinations and the risk of building on top of incompletely defined requirements, rigorous verification is now more important than it was in the manual era. If we accept poorly defined or unverified code into our codebase today, we are simply compounding technical debt at an accelerated rate.

Closing the Loop

This shift isn’t something we need to wait for Jira or other vendors to solve. You can implement this workflow today by using existing tools as simple tracking mechanisms for robust specifications and automated session logs.

Over time, I expect legacy ticketing features focused on granular implementation steps to be sunsetted or deprecated as the industry moves toward this new model. By focusing on the feedback loop—using AI to improve specifications and ensuring rigorous verification—we can ensure that “faster” also means “better.”