<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Steve Whittle]]></title><description><![CDATA[Steve Whittle]]></description><link>https://www.stevewhittle.net</link><image><url>https://www.stevewhittle.net/img/substack.png</url><title>Steve Whittle</title><link>https://www.stevewhittle.net</link></image><generator>Substack</generator><lastBuildDate>Tue, 23 Jun 2026 12:43:56 GMT</lastBuildDate><atom:link href="https://www.stevewhittle.net/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Steve Whittle]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[stevewhittle@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[stevewhittle@substack.com]]></itunes:email><itunes:name><![CDATA[Steve Whittle]]></itunes:name></itunes:owner><itunes:author><![CDATA[Steve Whittle]]></itunes:author><googleplay:owner><![CDATA[stevewhittle@substack.com]]></googleplay:owner><googleplay:email><![CDATA[stevewhittle@substack.com]]></googleplay:email><googleplay:author><![CDATA[Steve Whittle]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Floor Rises]]></title><description><![CDATA[What happens when AI moves off the Cloud?]]></description><link>https://www.stevewhittle.net/p/the-floor-rises</link><guid isPermaLink="false">https://www.stevewhittle.net/p/the-floor-rises</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Thu, 18 Jun 2026 22:21:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!dvVM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dvVM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dvVM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!dvVM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!dvVM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!dvVM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dvVM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6831418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/202648295?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dvVM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!dvVM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!dvVM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!dvVM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c21eed0-59ce-4a63-8b65-364e80d4beba_2816x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>At WWDC this month, Apple introduced a piece of software it calls the system orchestrator. It routes every AI request to the model that should handle it, on the device or in the cloud, based on how much compute the task needs and how sensitive the data is. Craig Federighi called it the key to the system&#8217;s privacy architecture.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>That routing decision is the most consequential architectural shift in consumer AI since the transformer. Not the model. The decision about where the model runs.</p><h3><strong>Cloud by default was never a decision</strong></h3><p>From 2022 through 2024, anything that looked like reasoning went to an API. GPT-4, Claude, Gemini all lived in central data centers. On-device AI meant autocomplete and spell check. The hardware floor on a phone was too low to run a capable model, so the model lived somewhere else.</p><p>The constraint felt permanent because it had always been true but Always has does not mean always will. We made the same error with local storage, with video on mobile, with encryption in the browser. The thing that is too expensive to do locally stays in the data center right up until it doesn&#8217;t.</p><h3><strong>The floor is already higher</strong></h3><p>Look at what already left the cloud. Real-time transcription runs on your phone. Translation runs on your phone. Photo understanding, system-wide dictation, short-form classification, writing suggestions. None of these phones home anymore. Apple&#8217;s second on-device model, the more capable one for current Apple silicon, understands speech, reads text and images, and never leaves the device.</p><p>We should measure the floor by what it can do, not by its parameter count. In 2023 the floor was &#8220;can it follow an instruction.&#8221; In 2026 the floor is &#8220;can it handle most single-document knowledge work without a network connection.&#8221; </p><p>The reason this happened so fast is the efficiency curve underneath. Sparse activation, distillation, and aggressive quantization decoupled capability from cost. A sparse model carries a large parameter count but uses only a fraction of those weights on any given token. The headline says twenty billion parameters. The actual work on a prompt might touch a few billion. That gap is the trick that put a real model on a phone.</p><p>The floor doesn&#8217;t rise on a smooth line. It jumps when a new efficiency technique lands and then sits flat until the next one. Mixture-of-experts was a jump. Quantization down to four bits was a jump. Nobody knows the slope of the next five years but he direction is obvious.</p><h3><strong>Routing is the product, not the model</strong></h3><p>The orchestrator was the most underreported thing Apple showed. The press wrote about Siri&#8217;s new voice. The architecture was the story that was glossed over.</p><p>What does the orchestrator decide? Latency tolerance, privacy sensitivity, task complexity, and the hardware tier of the device in your hand. Send the easy local thing local. Send the hard private thing to Private Cloud Compute. This is not a new problem in infrastructure terms. A CDN decides whether to serve from the edge or the origin. A database decides whether a read hits a replica or the primary. Routing a request to the cheapest box that can satisfy it is the way we&#8217;ve always done it.</p><p>The new part is what the routing decision means. A CDN serving a cached asset gives you the same asset. An orchestrator routing to the small model gives you a different answer than the big model would have. The routing decision now changes the quality of the intelligence, not just the latency. That is a category of routing we haven&#8217;t built governance for.</p><p>Whoever owns the routing layer owns the usage signal. The model that gets called is the model that sees the data, collects the feedback, and improves fastest. Apple makes that decision for more than a billion devices. </p><p>This is how Apple played it. The next-generation foundation models are being co-developed with Google&#8217;s Gemini, and the cloud tier runs on Nvidia GPUs inside Google&#8217;s cloud. Apple did not insist on owning the best model. It rented the model and kept the layer that decides which model you reach. The model is rentable. The routing layer is not. This is the key. If somewhere down the road, Apple decides to switch to their own cloud based AI, they can. This is what happened with mapping software, Apple partnered with Google until it developed its own. Who knows if Apple will develop their own models but if they do, they can turn it on very quickly.</p><p>The open question is whether the orchestrator stays an honest dispatcher or becomes a toll booth.</p><h3><strong>Regulation is fragmenting AI by geography</strong></h3><p>Consumer privacy preference is real and soft. People say they care and then tap accept. Regulation does not depend on what anyone prefers. It is hard, and it is fragmenting AI deployment along national lines. </p><p>Look at the Siri example: At WWDC, Apple announced that Siri AI will not ship in the EU with iOS 27. Apple framed the block as privacy and security. The cause is actually the Digital Markets Act, which is competition law, not privacy law. The European Commission rejected Apple&#8217;s framing in public and without hedging. Nothing in the DMA blocked the launch. The delay was Apple&#8217;s own choice. The real objection is that Siri AI, which is powered by Google&#8217;s Gemini, would lock out competing assistants on a platform that reaches hundreds of millions of users. The Commission&#8217;s spokesman named the Gemini detail on purpose.</p><p>This is not &#8220;privacy law is forcing AI onto devices&#8221;.  It&#8217;s deeper than that. AI deployment is fragmenting by jurisdiction, and the pressure comes from several different bodies of law at once. Competition law in Brussels. Data localization in India, China, Saudi Arabia, and Brazil. A model that runs locally sidesteps an entire class of these fights because the data never crosses a border. The on-device model is not just faster and more private. It is jurisdictionally invisible.</p><p>This is where sovereignty stops being a slogan. A regulated enterprise or a government does not want a privacy assurance about someone else&#8217;s data center. It wants the weights, the inference stack, and the audit trail inside its own borders. Private Cloud Compute, however well it is engineered, is still someone else&#8217;s cloud in someone else&#8217;s country. The requirement is on-premise open-weight deployment. Models like Llama and Mistral are important here for a reason that has nothing to do with topping a benchmark. You can run them where the law says the data has to stay.</p><p>US export controls accelerate all of this. A country that cannot count on access will build or source its own. The recent retraction of Anthropic&#8217;s Fable 5 model is real world example of this. Scarcity is a fast teacher of self-reliance. The open question is whether on-device and on-premise can satisfy the provenance and audit demands that regulators will have.</p><h3><strong>Losing the commodity tier is not the threat</strong></h3><p>The tasks leaving the cloud were never the high-margin tasks. The person summarizing email on a phone was not paying for an API call. Losing that load costs the frontier labs close to nothing.</p><p>The real exposure is the middle. Function-level code completion. Document question-and-answer. Single-session knowledge work. These are mid-complexity and high-volume, and they are moving on-device, likely inside two years. The timing will vary based on the use case.</p><p>The floor does something quieter to the rest. When good-enough AI is free, private, and already in your pocket, every upgrade has to clear a higher bar to earn a dollar. Apple&#8217;s orchestrator teaches a billion people that AI is the free thing that already lives on the phone. That reshapes what AI is worth in the median person&#8217;s head before any frontier lab gets to make its pitch.</p><p>Volume and revenue are not the same curve. A frontier provider can lose sixty percent of its query volume and far less than sixty percent of its revenue, as long as it priced the commodity tier near zero in the first place. The query that left was the query they weren&#8217;t making money on anyway.</p><p>So who is actually exposed? Not the top labs but the mid-tier API resellers and the vertical SaaS products whose only moat was access to commodity inference. When the inference is free and local, the wrapper has nothing left to sell. The move for the labs at the top is to go up. Own the reasoning and the agentic layer. Concede the commodity tier without a fight, because fighting for it means competing with free. The open question is whether &#8220;own the reasoning layer&#8221; is a durable position or just the next floor waiting to rise under it.</p><h3><strong>The grid did not die. Its economics did.</strong></h3><p>The useful precedent is electricity, not software. As the electrical grid matured, generation moved closer to the load. Rooftop solar, home batteries, and the central grid now coexist. Nobody decommissioned the grid. What changed permanently was the economics of building another central plant when the marginal load could be served closer to home.</p><p>The analogy holds here too. Local handles the frequent, ambient, low-stakes load. Central handles the peak, the complex, the high-stakes work. Neither one kills the other.</p><p>But here is where the analogy breaks: Electricity is fungible. A kilowatt-hour off your roof and a kilowatt-hour off the grid are the same kilowatt-hour, and your refrigerator cannot tell the difference. Intelligence is not fungible. The answer from the local model and the answer from the frontier model are different answers, sometimes meaningfully different, and the user often cannot tell which one they got.</p><p>The analogy gives us the shape and then fails at the load-bearing joint. The floor rising does not make the ceiling matter less. It makes the gap between floor and ceiling the thing you are selling. The product stops being &#8220;AI.&#8221; The product becomes the difference between the AI everyone has for free and the AI that is worth paying for. Pricing that gap, and proving it exists on any given task, is the whole game now.</p><h3><strong>Where do we go from here?</strong></h3><p>We can see the floor rising. We can see the gap becoming the product. What&#8217;s still unknown is who governs the routing layer.</p><p>Today Apple decides for Apple devices. Google decides for Android. Microsoft decides for Windows Copilot. None of these is neutral infrastructure. Each one is a company that also sells models, which means each routing decision is made by a party with a stake in the outcome.</p><p>The flywheel is the important part. The model that gets called gets the usage signal. The usage signal becomes the improvement that makes it the model worth calling next time. Whoever holds the orchestrator holds the flywheel. Right now the platform owners are holding it, and they did not have to buy a single frontier lab to do it.</p><p>If routing gets captured at the platform layer, the frontier labs become wholesale suppliers to a retail counter they do not own. They will have built the best engine in the world and handed the dashboard to someone else.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Context Rot]]></title><description><![CDATA[Not all context is created equal]]></description><link>https://www.stevewhittle.net/p/context-rot</link><guid isPermaLink="false">https://www.stevewhittle.net/p/context-rot</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Thu, 04 Jun 2026 23:12:50 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!m-Ie!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m-Ie!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m-Ie!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!m-Ie!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!m-Ie!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!m-Ie!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m-Ie!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6836613,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/200690674?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m-Ie!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!m-Ie!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!m-Ie!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!m-Ie!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe945a4b2-7de1-4332-b1dd-6847292dc485_2816x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Morph LLM published a study in March 2026 with a claim that every frontier model they tested gets worse at reasoning as you give it more to read. The output quality drops in a way you can measure and reproduce, and it happens on the exact models you are paying for. The instinct is to file this under &#8220;models are unpredictable&#8221; and move on. That instinct is the mistake this post is about.</p><p>Let&#8217;s start with the symptom, because you have probably already seen it. A multi-stage agent runs cleanly through step three. It reasons correctly, calls the right tools, returns what you expected. By step twelve the same agent, on the same task type, with the same model and the same prompt pattern, is producing degraded output. Nothing about the task got harder. Something about the <em>context</em> got heavier. That gap, between a system that works early and the same system failing late, is what I want to talk about. It&#8217;s called <strong>context rot</strong>, and it&#8217;s architectural, which is the part that matters. You can&#8217;t upgrade your way out of an architectural property.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The spec sheet measures the wrong thing</h2><p>When a vendor prints 128K on the box, that number is a ceiling on what the model will <em>accept</em>. It says nothing about what the model can reason over <em>reliably</em>. Those are different quantities, and the gap between them is enormous.</p><p>Paulsen and colleagues gave the distinction names in September 2025. The Maximum Context Window is the spec sheet figure. The <strong>Maximum Effective Context Window</strong> is the point where accuracy meaningfully degrades for a given task. Their findings show that some top models showed severe degradation by 1,000 tokens. A few broke down with as little as 100 tokens in context. Across everything they tested, the effective window fell short of the advertised one by as much as 99 percent.</p><p>That&#8217;s a big difference. A 128K window can deliver reliable reasoning over something closer to 1K on a hard task. The other 127K is tokens the model will dutifully process and cannot actually use well.</p><p>The effective limit also moves with the job. A flat retrieval task and a multi-hop reasoning task over the same corpus have different effective windows, because reasoning over scattered evidence stresses attention harder than fetching one fact. So there is no single number to memorize. You need to figure out for yourself how it will impact your workflow.</p><p>The RAM comparison is exact, not loose. A machine with 128GB of memory does not run every workload 128 times faster than a 1GB machine. The number is capacity, not throughput. Context windows are the same. We have been reading a capacity spec as a performance spec, and the vendors are happy to let us.</p><h2>Why attention thins out</h2><p>Three mechanisms compound as context grows. You want the mechanisms, not just the result, because the mechanisms tell you what to do.</p><p><strong>Position is not neutral.</strong> Liu and colleagues at Stanford showed transformer attention follows a U-shape across token positions. The model attends strongly to the start and the end of its context and weakly to the middle. In a 20-document QA task, accuracy dropped more than 30 percent when the relevant document sat in positions 5 through 15 versus position 1 or 20. The same fact, the same model, moved a few thousand tokens inward, and the model half-forgets it is there. Models trained explicitly on long context still showed it. So the first thing to internalize: <em>where</em> a token sits changes whether the model can use it most effectively.</p><p><strong>Attention is a fixed budget split more ways.</strong> The attention mechanism distributes a finite amount of probability mass across every token in context. Add tokens and each one gets a thinner slice. This is not a tuning artifact. It falls out of the softmax math. More context does not mean more attention to go around. It means the same attention spread thinner, and past some point the model can no longer cleanly separate the signal token from the noise around it.</p><p><strong>The curve flattens early.</strong> A model reading 100K tokens is not getting ten times the benefit of one reading 10K. Positional undertraining and the constraints of rotary position embeddings mean effective utilization grows far slower than the token count. The useful curve bends down well before the nominal limit. You are paying linearly for tokens whose value is already sub-linear.</p><p>Put the three together and the picture is not &#8220;long context is a bit lossy.&#8221; It is that the middle of a large context is a place where information goes to be underweighted, and you are filling it on purpose.</p><h2>Agent loops manufacture their own noise</h2><p>Static document Q&amp;A is forgiving. You load a corpus once and ask questions. Agent loops are much less forgiving because they generate the very noise that rots them.</p><p>Watch what accumulates in a long-running loop. Verbose tool outputs, most of which were only partially relevant. Reasoning paths the model explored and abandoned. Intermediate states that are no longer true. And the one almost everyone forgets: prior turns where the user chose between options.</p><p>That last one deserves a deeper look, because it is the clearest illustration of the whole problem. The model offers two options. You pick the first. The second option is not dropped. It sits in context carrying the same weight as the path you chose. Ten turns later the model is reasoning over a growing pile of decisions you already killed. The dead ends do not decay. They sit alongside the live thread, drawing from the same thinning attention budget, and the model has no way to know they are dead unless you tell it.</p><p>This is why Morph LLM&#8217;s work, which folded in Chroma&#8217;s research on coding agents, landed on context rot as the <em>primary</em> failure mode for agentic coding. Not model capability. Not reasoning quality. The models are capable enough. The context they are handed is too noisy, and most of the noise is self-inflicted.</p><p>The diagnostic is precise once you know where to look. Degradation that scales with session length rather than task complexity is context rot. If your agent clears step three and fails step twelve with no change in task type, stop tuning the prompt. You are not looking at a capability problem. You are looking at a hygiene problem.</p><h2>The two curves cross</h2><p>Everything so far is about quality. The cost side is where it stops being an abstraction and starts showing up on the bill.</p><p>The KV cache scales with context length, and at high token volumes that scaling is not gentle. Latency climbs. VRAM consumption grows. Throughput falls. None of these track token count proportionally. They accelerate.</p><p>Think about how that pairs with the quality finding. The cost per useful unit of work is rising at exactly the point where each token is delivering less. This is not ordinary diminishing returns, where you pay the same for a bit less. The cost curve bends <em>up</em> while the quality curve bends <em>down</em>. Those two curves cross. Past the crossing point you are paying more to get worse output, and nothing in the system will let you know you&#8217;ve crossed it.</p><p>Mixture-of-Experts architectures make the diagnosis harder, not easier. They can hide this dynamic underneath infrastructure bottlenecks, so the system looks healthy right up until it is obviously not.</p><h2>Manage it like memory, because that is what it is</h2><p>The bottom line is that context is a resource with a budget, and right now most systems don&#8217;t see it that way. Here are three things you can do:</p><p><em>Budget context against your effective window, not the spec sheet.</em> Estimate the effective window for your task type and set a token budget per loop or session against that number, not the advertised one. When you hit 60 to 70 percent of the budget, trigger a checkpoint. Make the checkpoint a structured state object, not a prose summary: decisions made, options rejected, current state. Then continue from the checkpoint instead of the full transcript. You are choosing what survives into the next turn rather than letting accumulation choose for you.</p><p><em>Prune the dead weight.</em> A tool call that returned nothing useful should be stripped or compressed before the next turn. A rejected option should collapse to a single line. The model does not need the full text of a path it is not taking, and every token you remove is a token that can no longer dilute attention on the tokens that matter. Do this as part of the loop.</p><p><em>Measure degradation as a function of length, not just at turn one.</em> Run your evals at turns 5, 10, and 20, not only at the start. Watch accuracy as context grows on your specific task. If you cannot state your effective window as a number, you do not know where your system fails, which means you are flying without a parachute.</p><p>A bigger window is not the solution. The Morph LLM data spans GPT-4.1, Claude Opus 4, and Gemini 2.5 Pro. Every one degrades. The window sizes differ. The architectural property does not. A larger window buys you more room to accumulate noise, not more reasoning.</p><h2>What the vendors are not selling you</h2><p>Context window size has become the spec that vendors compete on, the way hardware vendors once competed on raw clock speed. The number is real. What it implies about useful work is a different claim, and it&#8217;s the claim you don&#8217;t tend to hear about.</p><p>The engineers who build reliable long-context systems will not be the ones with the largest windows. They will be the ones who treat context the way a systems engineer has always treated memory: as a scarce resource with budgets, checkpoints, and active reclamation. The model will not do this for you. It cannot garbage collect its own context. It has no concept of which tokens are dead. That judgment, what to keep and what to get rid of, is the part that does not come in the box, and it&#8217;s up to you to figure out.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Cost and Compliance Trap]]></title><description><![CDATA[Plan for it now or pay for it later]]></description><link>https://www.stevewhittle.net/p/the-cost-and-compliance-trap</link><guid isPermaLink="false">https://www.stevewhittle.net/p/the-cost-and-compliance-trap</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Mon, 01 Jun 2026 19:08:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RXH3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RXH3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RXH3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!RXH3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!RXH3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!RXH3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RXH3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6959921,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/200169982?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RXH3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!RXH3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!RXH3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!RXH3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97eff52a-9168-4d25-a77a-084f148f81fb_2816x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>In December 2025, the Google Cloud CTO office published a post-mortem on what they called the &#8220;reliability gap&#8221; in agentic AI systems. The core finding was not subtle: the move from deterministic software to probabilistic agentic workflows introduces failure modes that most teams only discover after architecture decisions are locked. The paper described multi-step agent chains operating without transaction coordination, creating non-atomic failures that can produce irreversible side effects and data corruption mid-operation.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>That is not a model quality problem. It is a structural design problem. And it is expensive in two ways that most AI cost models don&#8217;t account for.</p><h2>The bill arrives after the architect has left</h2><p>LLM inference costs scale with volume and context length. Everyone knows this. Where it hets more complicated is when you chain agents together.</p><p>Every step in a multi-agent workflow triggers at least one model call and also, potentially, retries. Context lookups trigger more. Add a reflection layer, an orchestration agent, a validation pass, and your per-transaction cost can explode.</p><p>This comes about because the costs don&#8217;t start to mount until it gores into production and the builders have moved on. By the time you see this, it&#8217;s too late.</p><p>The vendors selling agent frameworks are not incentivized to surface this. Their pricing models are based on consumption. A platform that makes it trivially easy to chain five agents together is probably not going to ship a cost impact calculator so you know what you are getting into. You discover the problem at the quarterly cloud invoice review.</p><p>Right sizing is not hard to figure out; deterministic execution for stable, high-volume paths, and probabilistic reasoning reserved for genuinely ambiguous edge cases where it creates real value. Many teams reach for an agent because the demo was impressive, not because the task required reasoning under uncertainty.</p><h2>Regulation does not grade on a curve</h2><p>The EU AI Act imposes documentation, transparency, and auditability obligations on high-risk AI systems, with phased enforcement running through 2027. At it&#8217;s most basic level: a system subject to these rules must be able to reconstruct its decision logic. It must be able to answer, on demand, why a particular output was produced.</p><p>A live LLM agent in the execution loop can&#8217;t answer that question. The reasoning is not stored. The attention weights are not auditable. The path from input to output runs through billions of parameters and produces no human-readable trace of its logic.</p><p>A deterministic workflow that was generated by an AI can answer that question. The code exists. The execution path is traceable. The logic is inspectable.</p><p>This distinction matters beyond Europe. Any organization operating in healthcare, finance, insurance, or legal already faces existing sector regulation with equivalent requirements. HIPAA audit trails. SEC recordkeeping. Insurance claim adjudication documentation. These obligations don&#8217;t go away because you replaced a rules engine with an agent. They get stronger, because the regulator now wants to understand the AI component specifically.</p><p>The compliance teams asking these questions are not being obstructionist. They are pointing at a real architectural gap that the engineering team may not have modeled as a risk.</p><h2>Trust is already declining</h2><p>Augment Code&#8217;s 2025 developer survey found that trust in AI-generated code accuracy dropped from 40% to 29% in a single year. That is not a rounding error. Adoption grew over the same period, reaching 62% of developers, but confidence in the output fell sharply.</p><p>More developers are using AI. Fewer of them trust what it produces. Developers are shipping AI-assisted work while privately applying a larger mental discount to its reliability. That discount has a cost. It shows up as increased review time, more defensive testing, slower iteration on the critical paths where the stakes are highest.</p><p>Enterprise teams that have moved to hybrid architectures report a different pattern. When agents set goals and orchestrate tasks but critical computations run inside deterministic modules, the LLM surface area at runtime shrinks. The auditability problem gets smaller because the probabilistic component is doing less. Data shows teams using this approach achieving over 80% reduction in manual effort while maintaining the traceability their compliance functions require.</p><h2>What the architecture decision actually is</h2><p>Treating atomicity as an infrastructure requirement instead of a prompting challenge changes what you build. It means designing agent chains with transaction boundaries, rollback logic, and explicit failure modes before you write the first prompt. It means asking, for every step in the workflow, whether this step requires reasoning under uncertainty or whether it requires reliable execution of a known procedure.</p><p>Most enterprise AI architectures are not designed this way. They are grown from demos. The agent that worked in the proof of concept becomes the agent in production, with more context, higher volume, and a compliance audit scheduled for Q3.</p><p>The Google Cloud paper put it directly: non-atomic failure modes create irreversible side effects. That is an infrastructure problem. Changing the system prompt does not fix it.</p><p>The question I keep coming back to is whether the hybrid architecture pattern will be adopted proactively or reactively. Some teams will design for it from the start. Most will discover the need for it after the first incident that cannot be explained to a regulator, or the first quarter where inference costs exceed the value the system generated.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[When all you have is a Hammer]]></title><description><![CDATA[All problems look like nails]]></description><link>https://www.stevewhittle.net/p/when-all-you-have-is-a-hammer</link><guid isPermaLink="false">https://www.stevewhittle.net/p/when-all-you-have-is-a-hammer</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Thu, 28 May 2026 14:58:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0udg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0udg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0udg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!0udg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!0udg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!0udg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0udg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6842233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/199372432?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0udg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!0udg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!0udg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!0udg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57ed02e8-349a-49be-b895-c8a08223a044_2816x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Nobody set out to discover a better architecture for AI systems. It happened because someone typed &#8220;write me a function that parses this JSON&#8221; and the result just worked.</p><p>The LLM ran once. It interpreted the intent, produced code, and exited. The code ran in CI, in production, on a schedule. No LLM in the loop. Nobody called it an &#8220;AI agent.&#8221; They called it software.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>This was not a product decision. It was not an architectural principle. It was the path of least resistance. The developer got what they needed and moved on. The LLM was done before the workflow started.</p><p>That accident of framing contains the most important architectural insight in AI systems today. Most teams building agentic workflows have completely missed it.</p><h2>The hammer is real. The grip is the problem.</h2><p>The capabilities are genuine. An LLM can interpret ambiguous human intent, navigate unstructured input, and produce structured output from near-infinite input variations. Those are hard problems. LLMs solve them well.</p><p>The mistake is assuming that because an LLM can do something at design time, it should keep doing it at every execution. On every single run.</p><p>Maslow&#8217;s famous observation was that if your only tool is a hammer, every problem looks like a nail. The AI industry&#8217;s current posture is a precise instance of this. We have a remarkable tool. We are applying it to every stage of the pipeline, including stages where it has no business being.</p><p>The problem is not that LLMs are unreliable. The problem is that <strong>probabilistic systems do not belong in deterministic execution paths</strong>.</p><h2>What &#8220;probabilistic in the execution path&#8221; actually costs</h2><p>The math is unforgiving. A 20-step agentic workflow where each step succeeds 95% of the time produces an end-to-end success rate of 36%. This is not a model quality problem. Better models improve the per-step number. They do not change the compounding structure. The architecture is the problem.</p><p><a href="https://arxiv.org/abs/2402.01817">Kambhampati et al.</a> established the theoretical foundation for this in their ICML 2024 paper on LLM-Modulo frameworks. Auto-regressive LLMs cannot, by themselves, do planning or self-verification. The LLM is a probabilistic knowledge source. Treating it as a deterministic executor is a category error.</p><p><a href="https://arxiv.org/abs/2603.25764">Mehta&#8217;s recent consistency research on SWE-bench</a> sharpens this further in a way most people have not fully processed. Across 150 agent trajectories, the finding was not that inconsistent models fail more. It was that consistency amplifies outcomes rather than guaranteeing them. Claude achieved 58% accuracy with 15.2% behavioral variance. When it interpreted a task correctly, it succeeded on 100% of runs. When it got the interpretation wrong, it failed on 100% of runs. The same wrong answer, five times in a row, with high confidence.</p><p>A model making the same incorrect interpretation on every run is worse than a randomly failing one. The failure is invisible until it compounds, and reflection-based recovery cannot help if the initial interpretation is wrong.</p><p><a href="https://arxiv.org/abs/2601.06112">ReliabilityBench</a> (Gupta, 2026) adds the production dimension. Agents achieving 96.9% accuracy on clean benchmarks drop to 88.1% under realistic perturbations. The paper&#8217;s broader finding: metrics on clean data overestimate production reliability by 20 to 40 percent. The gap is not a model problem. It is a measurement problem that reveals an architectural one.</p><p>Better models improve the per-step number. They do not change the compounding structure.</p><h2>The pattern developers discovered by accident</h2><p>Code generation is the canonical LLM-as-compiler case. The LLM interprets intent and emits a deterministic artifact. That artifact runs without the LLM. Nobody designed this as an architecture. It emerged because &#8220;write me code&#8221; has a natural exit condition: the code either works or it does not.</p><p>The insight generalizes far beyond code. A policy document can become an OPA rule set. A process description can become a Temporal workflow definition. Decision criteria can become a scoring model. The LLM&#8217;s job in each case is translation from human intent to a deterministic artifact. Once the artifact exists, the LLM is done.</p><p><a href="https://arxiv.org/abs/2310.03714">Khattab et al.&#8217;s DSPy framework</a> from Stanford is the explicit formalization of this idea. Compile the prompt and retrieval logic into an optimized, repeatable program rather than invoking raw LLM calls at runtime. The artifact owns the execution. The LLM&#8217;s work happens once, at design time.</p><p>The reason this pattern is underused outside of engineering contexts is not that it does not apply. It is that business users do not yet have the intuition that &#8220;the AI is finished&#8221; is a valid and preferable state. Finished means auditable. Finished means version-controlled. Finished means you can write a test for it.</p><h2>When the LLM does belong in the path</h2><p>There is a class of workflow where LLM-as-compiler fails. The failure mode is unstructured with novel input at execution time.</p><p>Customer support is the clearest case. You can compile decision trees for known issue categories. You cannot compile a response to a customer describing a problem you have never seen, in language you cannot predict, with emotional context that changes the appropriate answer.</p><p>The same logic applies to open-ended document classification where the category space is not closed; to real-time anomaly detection on free text; to any workflow where the output depends on semantic understanding of content that did not exist when the workflow was built.</p><p>The LLM belongs in the execution path when the input space at runtime is genuinely open-ended. That is a much smaller category than most people think.</p><p>The question is not &#8220;can an LLM handle this at runtime?&#8221; The question is &#8220;does this workflow require runtime interpretation, or was the interpretation problem already solved at design time?&#8221;</p><h2>The decision rule</h2><p>Can you fully describe the execution logic as a flowchart, rule set, or decision tree before the workflow runs?</p><p>If yes, the LLM&#8217;s job is to build that artifact. Not to approximate it on every execution.</p><p>If the answer is no, because the logic depends on content that cannot be anticipated or because the decision space is genuinely open, then the LLM belongs in the path.</p><p>The key to reliability is to apply this test at all times and minimize LLM surface area at runtime. Not because LLMs are bad. Because deterministic systems are auditable, testable, version-controllable, and do not compound errors across steps.</p><p>Most teams are not asking this question. They are asking &#8220;how do we make the agent more reliable?&#8221; That is the wrong starting point.</p><p>If you are running an LLM on every execution of a workflow that could have been compiled into a deterministic artifact, you have not built an AI system. You have built a reliability problem with an AI-shaped interface.</p><p>The first question to ask is not whether your LLM is good enough. It is whether it needs to be there at all.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Does AI Fit in the workflow?]]></title><description><![CDATA[Not every workflow needs it]]></description><link>https://www.stevewhittle.net/p/does-ai-fit-in-the-workflow</link><guid isPermaLink="false">https://www.stevewhittle.net/p/does-ai-fit-in-the-workflow</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Tue, 26 May 2026 19:19:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!XSQb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XSQb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XSQb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!XSQb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!XSQb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!XSQb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XSQb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5563034,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/199372011?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XSQb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!XSQb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!XSQb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!XSQb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc0524374-636b-45aa-89b4-9a4abd47f6d2_2816x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In October 2023, New York City launched an AI chatbot to help small business owners navigate city regulations. The mayor&#8217;s office called it a frictionless doorway into City Hall. Five months later, investigative journalists at The Markup tested it and found it was confidently telling landlords they did not have to accept Section 8 housing vouchers. That is illegal in New York City. The same bot told employers they could take a cut of their workers&#8217; tips. Also illegal. When ten journalists asked the same housing question independently, all ten got the same wrong answer.</p><p>The city added a disclaimer. The mayor defended it and kept it running. The next administration eventually called it &#8220;functionally unusable&#8221; and moved to shut it down. The bill for building it: roughly $600,000.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Nobody asked the right question before launch. Not &#8220;can we build this&#8221; but &#8220;does a probabilistic text system belong in the execution path of legal regulatory guidance, where there is exactly one correct answer per question and the person asking has no way to know when they&#8217;ve received the wrong one.&#8221;</p><p>That question, whether AI belongs in a given workflow at all, is the one most organizations skip. They go straight to &#8220;how do we make this work&#8221; and treat the premise as settled. It isn&#8217;t always.</p><h2>The question teams are not asking</h2><p>When an AI deployment struggles in production, the instinct is to reach for better models, more data, or tighter prompt engineering. The conversation becomes technical quickly. That is the wrong reflex.The right first question is simpler and more uncomfortable: should AI be in this workflow at all?Not &#8220;can it be done&#8221; but &#8220;should it be done here, in this position, with these consequences attached.&#8221;Many teams never ask this. Gartner projects that 40% of enterprise applications will feature task-specific AI agents by 2026. MindStudio found that only 23% of companies run fully autonomous agent systems despite 88% already applying AI somewhere. The gap between &#8220;we use AI&#8221; and &#8220;we have thought carefully about where AI runs&#8221; is huge. That gap is where the expensive mistakes live.</p><h2>Blast radius is the first variable</h2><p><strong>The first question to ask about any workflow is: what happens when this output is wrong?</strong></p><p>Not if. When. Every probabilistic system produces wrong answers. The question is what those wrong answers cost.</p><p>A wrong answer in marketing copy gets caught in review and fixed in ten minutes. A wrong answer in a financial reconciliation can trigger a downstream payment workflow before any human sees it. A wrong answer in a regulatory guidance tool may prompt a landlord to illegally reject a tenant, with no visibility until a fair housing complaint lands.</p><p>This is blast radius. It is not a measure of how often the system fails. It is a measure of how bad a single failure can get. High blast radius workflows require a different standard of scrutiny before AI gets anywhere near the execution path.</p><p>The NYC chatbot did not fail because the underlying model was bad. It failed because the blast radius of wrong legal guidance is real harm to real people. The city had no way to know how many business owners had already acted on the bad advice before The Markup published its findings.</p><h2>Detectability is the second variable</h2><p><strong>The second question is: how would you know the output was wrong, and how quickly?</strong></p><p>This one is underweighted in almost every AI deployment conversation I have seen. Teams spend enormous effort on accuracy and almost no effort on error visibility.</p><p>Some errors are self-evident. A generated email that addresses the recipient by the wrong name gets caught before it sends. Some errors are invisible. A landlord who received incorrect guidance from an official city tool has no reason to doubt it. The tool carried government branding. It answered confidently. The correct answer was on a different page of the same website, and nobody told them to go look there.</p><p>The combination that kills enterprises is low detectability paired with high blast radius. That is the dangerous quadrant. Errors that are hard to see and expensive when they land. That is where autonomous agents are being deployed right now at scale.</p><p>The EU AI Act, phased in between 2024 and 2027, is not primarily about model accuracy. It is about auditability. Regulators are not asking whether your AI is usually right. They are asking whether you can prove what it did and why. That is a detectability requirement disguised as a compliance requirement.</p><h2>Variance is only a feature in the right context</h2><p><strong>The third question is: does the right answer change between runs, or is there one correct answer?</strong></p><p>Large language models are non-deterministic by design. They produce different outputs given the same input. For some tasks that variance is the point. Generating five options for a campaign headline and picking the best one is a workflow where variance creates value.</p><p>For other tasks variance is a defect. A financial reconciliation has one correct answer. A patient&#8217;s medication dosage has one correct answer. A contract clause either complies with local law or it does not.</p><p>The NYC chatbot illustrated this with unusual clarity. Whether a landlord is legally required to accept a housing voucher is not a matter of interpretation. There is a correct answer. Yet the bot gave different answers at different times and gave the wrong answer to all ten journalists who tested it simultaneously. A system producing variance in a zero-variance workflow is not a system having a bad day. It is a system that was never fit for the task.</p><p>Variance only creates value where the space of acceptable outputs is large and the evaluator is a human who can use judgment to select. Anywhere the evaluator is a downstream system, a database constraint, or a person who treats official guidance as authoritative, variance is a bug.</p><h2>Downstream systems change the risk profile entirely</h2><p><strong>The fourth question is: is there a system consuming this output directly, without human review in between?</strong></p><p>This is the difference between a soft consumer and a hard consumer. A human reading an AI-generated summary is a soft consumer. They bring judgment. They catch obvious errors. They can push back.</p><p>An API endpoint that ingests an AI output and triggers an action is a hard consumer. It has no judgment. It cannot catch errors. It will execute faithfully on whatever it receives.</p><p>The NYC chatbot consumer was human but effectively hard. A small business owner consulting an official government tool on a compliance question is not going to fact-check the answer. The government branding collapsed the soft consumer into something that behaved like a hard one. The output fed directly into a decision with no meaningful review layer in between.</p><p>When AI feeds a hard consumer, the error characteristics of the system change completely. Errors no longer surface through human review. They surface through system failures, audit findings, or in this case, investigative journalism.</p><h2>Compliance exposure is not a legal problem</h2><p><strong>The fifth question: is this workflow subject to audit or regulatory review?</strong></p><p>This gets treated as a legal team concern. It is not. It is an engineering and product concern that shows up as a legal problem later.</p><p>Regulated workflows require explainability. They require audit trails. They require the ability to reconstruct exactly what the system did and on what basis. Current LLM architectures are not designed to produce that by default. You can build it in, but it requires deliberate design choices made before deployment, not retrofitted after a regulator asks a question.</p><p>The EU AI Act classifies certain AI uses as high-risk. Legal and regulatory guidance is squarely in that scope. High-risk classification does not mean prohibited. It means the compliance burden is significant enough that you need to answer all five questions before deployment, not after the headline.</p><h2>The real question behind the questions</h2><p>Run these five questions against any workflow you are evaluating for AI automation:</p><ol><li><p>What happens if this output is wrong? (blast radius)</p></li><li><p>How would you know it was wrong, and how quickly? (detectability)</p></li><li><p>Does the right answer change between runs or is there one correct answer? (variance tolerance)</p></li><li><p>Is there a downstream system or a trusting human consuming this output directly? (soft vs. hard consumer)</p></li><li><p>Is this workflow subject to audit or regulatory review? (compliance exposure)</p></li></ol><p>A workflow with high blast radius, low detectability, zero variance tolerance, a hard consumer, and regulatory exposure is not a workflow where you deploy an AI agent today.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Same Input, Different Answer ]]></title><description><![CDATA[You Didn't Buy Software. You Hired a Consultant.]]></description><link>https://www.stevewhittle.net/p/same-input-different-answer</link><guid isPermaLink="false">https://www.stevewhittle.net/p/same-input-different-answer</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Sun, 10 May 2026 23:16:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!4GGH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4GGH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4GGH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!4GGH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!4GGH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!4GGH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4GGH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:309346,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/197151245?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4GGH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!4GGH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!4GGH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!4GGH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce35a83-47a5-40dc-8e21-7b38c1cf4bb7_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most businesses deploying LLMs are treating them as a new category of software. The governance model they&#8217;re importing assumes a simple contract: same input produces same output, every time, reliably enough to stake compliance on. For example: procurement checklists, SLAs, acceptance testing, and audit trails.</p><p>LLMs are probabilistic. This breaks the model.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The setting that doesn&#8217;t do what you think it does</h2><p>The standard response to LLM nondeterminism is to set temperature to zero. That&#8217;s supposed to make the model deterministic. It doesn&#8217;t.</p><p><a href="https://aclanthology.org/2025.eval4nlp-1.12/">Researchers at ACL&#8217;s Eval4NLP workshop</a> published empirical results showing accuracy variations of up to 15% across runs with supposedly stable settings. The best-to-worst performance gap on some tasks reached 70%. Thread scheduling, parallel inference, mixture-of-experts routing, and server-side model updates all introduce variance that temperature settings cannot touch. Temperature=0 means you asked for determinism. It does not mean you got it.</p><p>Now chain three agents together, each operating at 90% reliability. Your combined system accuracy is around 73%. Add a fourth agent and you&#8217;re below 65%. Probabilistic error compounds through a pipeline the same way latency does. <a href="https://arxiv.org/abs/2602.16666">Princeton&#8217;s Language and Intelligence Lab</a> studied real-world agent failures against benchmark performance and found this gap is systematic. Not anomalous.</p><p>This is not fixable with better prompting. It&#8217;s architectural.</p><p>Temperature is just the most visible example of a broader pattern. Every instinct your team has about fixing nondeterminism will reach for a deterministic lever: tighter prompts, stricter validation, more retries etc. These are reasonable engineering responses to a problem that is not an engineering problem. The variance is not a bug you can tune out. It is a property of the system. That distinction matters because it changes what a correct response looks like. You can&#8217;t fix it. You need to govern it.</p><h2>The contract nobody disclosed</h2><p>Enterprise software has always carried an implicit contract: behavior is repeatable, attributable, and auditable. SOX compliance assumes it. SLAs codify it. Acceptance testing verifies it. When your ERP runs payroll, you expect identical outputs for identical inputs. The entire compliance stack is built on that expectation.</p><p>When you deploy an LLM into a business process, you break that contract. </p><p><a href="https://www.gartner.com/en/articles/hype-cycle-for-artificial-intelligence">Gartner&#8217;s AI Hype Cycle 2025</a> explicitly calls nondeterminism a core enterprise AI risk and projects that more than 40% of agentic AI projects will be canceled by 2027. A <a href="https://www.gartner.com/en/newsroom/press-releases/2026-03-30-gartner-predicts-by-2028-explainable-ai-will-drive-llm-observability-investments-to-50-percent-for-secure-genai-deployment">separate Gartner prediction from March 2026</a> goes further: without explainability foundations, GenAI will be restricted to low-risk tasks only.</p><p>The internal consequence is concrete. Your finance agent interprets policy differently on Tuesday than it did on Monday. Your audit trail stops being a record and becomes a probability distribution. <a href="https://arxiv.org/abs/2601.22290">This Six Sigma Agent paper</a> cites an MIT GenAI Divide Report finding that 95% of enterprise GenAI implementations fail to meet production expectations. That failure rate is not an implementation problem. It&#8217;s a mismatch between what the technology is and what organizations expect it to be.</p><h2>Manage it like professional judgment, not like software</h2><p>The right response to this is not to lower expectations. That&#8217;s too passive and too vague to be useful.</p><p>The right move is to adopt a different governance model entirely. One closer to how organizations manage professional judgment than how they manage software.</p><p>When you hire a consultant or engage outside legal counsel, you don&#8217;t write an SLA that promises identical advice on every engagement. You build in review gates, approval layers, and accountability structures. You treat the output as a recommendation, not a transaction. The answer might differ based on context and so you have a human in the loop before that recommendation becomes an action.</p><p>That is the correct mental model for an LLM. <a href="https://arxiv.org/abs/2603.24582">The Stochastic Gap paper from researchers at UMBC and MIT</a> formalizes this. Using a Markov framework applied to a real procurement workflow across 251,000 cases, they show that the mismatch between deterministic enterprise workflow assumptions and probabilistic AI behavior is structural, not incidental.</p><p>The consultant analogy is useful but it only gets you so far. The more precise frame is <strong>professional judgment under institutional accountability</strong>. Law firms, audit practices, and medical institutions have spent decades building governance structures for exactly this problem. Expert output varies by practitioner, by day, and by context. It still needs to be defensible, traceable, and bounded by institutional policy.</p><p>Those structures share three characteristics that software governance does not.</p><p><strong>Output is reviewed before it becomes a commitment.</strong> A legal opinion goes through a partner review before it&#8217;s sent. A diagnostic recommendation goes to an attending doctor before it reaches a patient. The expert produces the output; the institution validates it before it has consequences. Most LLM deployments skip this entirely. The model generates; the system acts. There is no partner review layer.</p><p><strong>Variance is bounded by policy, not eliminated by process.</strong> You don&#8217;t solve variance in professional judgment by making every practitioner identical. You solve it by defining the boundaries within which variation is acceptable. Then you build escalation paths for cases that fall outside them. A tax advisor can give different guidance to different clients with different circumstances. That&#8217;s appropriate variance. An LLM giving different legal interpretations to identical queries in the same product is not. The governance question is whether you&#8217;ve defined the difference.</p><p><strong>Accountability is personal and institutional simultaneously.</strong> When a lawyer gives bad advice, liability flows in two directions: to the individual practitioner and to the firm. Enterprise AI deployments are constructing a version of this whether they intend to or not. In 2024, a Canadian tribunal held Air Canada liable for incorrect information its chatbot gave a customer. The airline argued the bot was a separate legal entity outside their responsibility. The tribunal rejected that defense. Courts are already resolving institutional liability for AI output, and they are not resolving it in the operator&#8217;s favor. The question is whether your governance structure reflects that reality.</p><p><a href="https://arxiv.org/abs/2601.04170">This Agent Drift paper</a> introduces a useful concept here: the Agent Stability Index. It attempts to quantify behavioral consistency across model invocations. Not just whether the model gets the right answer, but whether it behaves in a predictable and bounded way over time. That&#8217;s the right unit of measurement for governance. Not accuracy but stability.</p><p>This reframes the deployment decision. The question is not &#8220;is this model accurate enough?&#8221; Accuracy is necessary but insufficient. The real question is: is this model stable enough, in a defined scope, with defined escalation paths, for this specific process? That&#8217;s how you evaluate a new hire in a professional services firm. It should be how you evaluate an LLM deployment.</p><p>Before deploying an LLM into any business process, three questions need answers:</p><ol><li><p>Does this process require the same answer every time for legal, compliance, or contractual reasons?</p></li><li><p>Is there a human review layer between the LLM output and the consequential action?</p></li><li><p>If the LLM gives two different answers to the same question, which one are you liable for?</p></li></ol><p>If you can&#8217;t answer those, you&#8217;re not ready to deploy. But there&#8217;s also a fourth question: who inside your organization owns the answer to all three? Software procurement has a buyer, a vendor, and a contract. Professional judgment has a supervising principal who is accountable for the output. LLM deployments currently have neither. They have a model, a prompt, and an assumption that someone else is responsible.</p><p>That assumption is what the courts are now testing.</p><h2>The expectation mismatch has standing in court</h2><p>This Air Canada case makes this concrete in a way that no whitepaper can.</p><p>In <a href="https://www.theguardian.com/world/2024/feb/16/air-canada-chatbot-bereavement-fares-ruling">Moffatt v. Air Canada</a> (BC Civil Resolution Tribunal, 2024), Air Canada&#8217;s chatbot gave a customer incorrect information about bereavement fares. The customer relied on it and tried to claim the discount. Air Canada&#8217;s defense was that the chatbot was a &#8220;separate legal entity&#8221; and not their responsibility.</p><p>The tribunal rejected this. Air Canada was held liable for the advice its AI gave.</p><p>The company deployed a probabilistic system into a customer-facing context where both the law and the customer assumed determinism. That expectation mismatch wasn&#8217;t just an internal governance failure. It had standing in court.</p><p>This is the trajectory for any business deploying LLMs into consequential interactions without understanding what they&#8217;ve actually built.</p><p>We are now in an era where companies are building legal and contractual liability on systems that cannot guarantee the same answer twice. The technology isn&#8217;t the problem. The category error is.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Counting Tokens]]></title><description><![CDATA[Optimizing token usage in applications]]></description><link>https://www.stevewhittle.net/p/counting-tokens</link><guid isPermaLink="false">https://www.stevewhittle.net/p/counting-tokens</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Thu, 07 May 2026 20:37:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yHoG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yHoG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yHoG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!yHoG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!yHoG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!yHoG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yHoG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:569568,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/196740501?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yHoG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!yHoG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!yHoG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!yHoG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F182d84dc-e778-4358-bf9c-675db91a3c99_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I&#8217;ve been building AI-powered features for the past several months, and one thing that became very clear very quickly is that token costs can&#8217;t be an afterthought. They&#8217;re a design constraint. When you&#8217;re running a healthcare platform with RAG pipelines, context-heavy prompts, and multiple LLM calls per user interaction, the bill is a direct reflection of how carefully you thought about what you&#8217;re actually sending to the model.</p><p>Most of the conversations I&#8217;ve seen about LLM cost optimization focus on the easy answer: pick a cheaper model. That&#8217;s the wrong starting point. Before you reach for a smaller model, you need to understand where your tokens are actually going. Otherwise, you&#8217;re optimizing blind.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>First, understand your token budget</h2><p>The single most useful shift in mindset is moving from &#8220;which model is cheaper?&#8221; to <strong>&#8220;how many tokens does this job actually need?&#8221;</strong></p><p>Every LLM call has a token budget made up of roughly four buckets: system prompt, user/conversation input, retrieved context (if you&#8217;re doing RAG), and model output. Most engineering teams can tell you which model they&#8217;re on. Not many of them can tell you what percentage of their token spend is coming from each bucket, per endpoint, per feature, per user segment.</p><p>That&#8217;s the first thing to fix. Instrument your app to log tokens in and out at the endpoint level. You&#8217;ll almost always find that 20% of your calls are responsible for 80% of your token spend, and those hotspots are where optimization actually pays off. Spending a week shaving 10% off a low-volume admin workflow is a distraction. Shaving 30% off your highest-volume user-facing is much more meaningful.</p><div><hr></div><h2>The input side</h2><p>Once you know where the tokens are going, the levers on the input side are fairly well-understood. The question is which ones are worth pulling.</p><p><strong>Tighten your prompts.</strong> This sounds obvious but most system prompts accumulate cruft over time. Redundant framing, hedging language, politeness conventions the model doesn&#8217;t need. Audit yours with fresh eyes. &#8220;Summarize in 3 bullets&#8221; costs fewer tokens than a paragraph explaining what a good summary looks like. Directive beats descriptive.</p><p><strong>Split your system prompt.</strong> If you have a large system prompt covering multiple capabilities, stop sending all of it on every call. Break it into a core section and optional sections, and load only what&#8217;s relevant for the task at hand. A prompt that&#8217;s always 3,000 tokens because it contains instructions for five features, when any given call only needs one of them, is something you can fix.</p><p><strong>Tune your RAG retrieval.</strong> RAG is great, but undisciplined RAG drags enormous amounts of context into every call. Smaller chunk sizes, fewer retrieved chunks, a reranker to prioritize the most relevant passages, and deduplication to eliminate near-identical chunks all compound. You often don&#8217;t need half the document. You need the three paragraphs that actually answer the question.</p><p><strong>Prune conversation history.</strong> Long chat threads accumulate fast. Rather than passing the full conversation on every turn, summarize earlier turns and keep a running state-of-the-world summary plus the last N interactions. The model doesn&#8217;t need a verbatim transcript of everything that happened three turns ago.</p><p><strong>Preprocess before tokenizing.</strong> Strip content before it hits the model. Repeated disclaimers, HTML chrome, email signatures, standard headers. Anything that&#8217;s structurally consistent and informationally irrelevant to the task should be removed in code, not sent to the model and silently ignored.</p><div><hr></div><h2>The output side</h2><p>Output tokens get less attention but they&#8217;re directly controllable.</p><p>Explicitly bound your responses. &#8220;At most 5 bullet points&#8221; or &#8220;under 150 words&#8221; is more effective than &#8220;be concise.&#8221; The latter is a suggestion; the former is a constraint. Use structured outputs (JSON, fixed schemas) wherever the output is machine-consumed. If you only need a label, a boolean, or a category, don&#8217;t ask for prose.</p><p>For user-facing answers, consider a layered approach: get a compact answer first, and only fetch a longer explanation on demand. Most users don&#8217;t need the full response most of the time.</p><div><hr></div><h2>The infrastructure layer</h2><p>Beyond prompt engineering, there are a few architectural levers that can meaningfully change the economics.</p><p><strong>Prompt caching.</strong> If your provider supports it, turn it on. Repeated system prompts and long shared documents shouldn&#8217;t be re-billed on every call. This is one of the highest-leverage, lowest-effort optimizations available. The work is mostly configuration, not engineering.</p><p><strong>Semantic caching.</strong> For support questions, FAQs, and any domain with high query repetition, a semantic cache at the application layer can eliminate a large percentage of redundant model calls entirely. Vector similarity against past queries, reuse or lightly edit the cached response.</p><p><strong>Model routing.</strong> Not all calls are equal. Route simple, well-defined tasks to smaller, cheaper models by default. Reserve the premium models for complex reasoning, high-stakes content, or tasks where quality differences are genuinely user-visible. The key is measuring this. A cheaper model that needs retries or follow-up calls can easily cost more than the premium model would have.</p><div><hr></div><h2>What doesn&#8217;t work</h2><p>A few things that seem like good ideas and aren&#8217;t:</p><p>Adding &#8220;be concise&#8221; to every prompt and expecting large savings. It helps a little. The variance is high and the model will still expand if the rest of your prompt gives it room to.</p><p>Blindly truncating inputs. &#8220;Just take the first 4K tokens&#8221; sounds pragmatic but often drops critical context and degrades quality in ways that are hard to notice until something goes wrong.</p><p>Naive string compression: stripping punctuation, collapsing whitespace. This makes text less legible to the model and can actually <em>increase</em> token count due to how subword tokenization works. The model&#8217;s tokenizer and your intuitions about &#8220;shorter text&#8221; don&#8217;t always agree.</p><p>Swapping to a cheaper model without measuring tokens-per-task and quality first. A weaker model that requires multiple retries or produces output that needs downstream correction can easily end up costing more than staying on the better one.</p><div><hr></div><h2>A note on TOON and structured formats</h2><p>There&#8217;s a newer format worth knowing about called TOON (Token-Oriented Object Notation). The idea is to strip JSON&#8217;s syntactic overhead, quotes, braces, repeated keys, by declaring keys once, like a header row. For flat, table-like data (user lists, product catalogs, RAG reference chunks, uniform agent outputs), benchmarks report roughly 30-60% prompt-token reduction with equal or slightly better accuracy on structured retrieval tasks.</p><p>It&#8217;s a real technique, but it comes with caveats. For deeply nested or irregular objects, the savings shrink or reverse. JSON is deeply embedded in LLM training data; TOON is a format the model hasn&#8217;t seen nearly as much, which introduces some fragility without fine-tuning. You&#8217;re also adding a serialization layer with converters, validation, and debugging in a nonstandard format that has real engineering cost.</p><p>The right framing for TOON: keep JSON in your APIs and storage. Use TOON only at the LLM boundary, and only when your prompts are heavy on structured data and the token savings are large enough to justify the complexity. It&#8217;s a specialized tool, not a universal one.</p><div><hr></div><h2>When not to optimize</h2><p>For small documents, low-volume internal tools, or safety-critical tasks, it&#8217;s often better to overspend on tokens than to risk degraded quality or missed edge cases. Token optimization has a cost in engineering time, added complexity, and potential quality tradeoffs. The ROI only makes sense at the hotspots.</p><p>The teams that get this right aren&#8217;t the ones that optimize everything. They&#8217;re the ones that instrument first, identify the real cost drivers, and apply targeted effort where it makes the biggest difference.</p><p>Token spend is a design output. If you&#8217;re surprised by your LLM bill, the answer isn&#8217;t a cheaper model. It&#8217;s a closer look at what you&#8217;re asking the model to do, and why.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[AI Agent Drift]]></title><description><![CDATA[Do you know when your Agent starts to go wrong?]]></description><link>https://www.stevewhittle.net/p/ai-agent-drift</link><guid isPermaLink="false">https://www.stevewhittle.net/p/ai-agent-drift</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Wed, 06 May 2026 02:35:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HZL8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HZL8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HZL8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!HZL8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!HZL8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!HZL8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HZL8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:448978,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/196613168?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HZL8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!HZL8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!HZL8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!HZL8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbe59583-c056-43fc-85ab-b05f36c2d9b8_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I run a competitive research workflow on a regular cadence. I use the same prompt, same tools, and the same intent. I want to map the landscape, surface new entrants and track what the competition is up to. The first few runs looked good. Then something started shifting. Not the market, our competitors in this space don&#8217;t move that fast. The outputs were different. Different companies surfaced. Different framing of the same players. Different conclusions from largely the same inputs.</p><p>The agent hadn&#8217;t broken. It was still producing plausible, well-structured research. It just wasn&#8217;t producing <em>consistent</em>research.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The gap between an agent that performs well and an agent that performs reliably is not something I hear talked about very often.</p><h2>Why the demo always works</h2><p>LLMs don&#8217;t retrieve. They generate. Every output is a sample from a probability distribution shaped by the prompt, the context window, and the model version. In a single-turn interaction this rarely matters. The variation is small and the output is usually close enough.</p><p>Agentic systems change the equation. Outputs become inputs to the next step. Tools get called, state accumulates, and decisions made in step two shape what is possible in step six. Small variations compound. The agent that searched slightly differently in step two is now summarizing different sources in step four. It draws different conclusions in step six. This is not a bug in any conventional sense. It&#8217;s the nature of the beast.</p><p>A demo is a single run. It proves the agent can do the task. It says nothing about whether the agent will do the task the same way tomorrow, after a prompt change, or after the underlying model gets quietly updated by the provider.</p><h2>Reliability is not following capability</h2><p>Recent research makes this concrete. A January 2026 simulation study from arXiv (<a href="http://arxiv.org/abs/2601.04170">arxiv.org/abs/2601.04170</a>) defines <strong>agent drift</strong> as the progressive degradation of behavior, decision quality, and coherence over extended interactions. It identifies three distinct forms: Semantic drift is where outputs deviate from original intent. Coordination drift is where multi-agent coherence breaks down. Behavioral drift is where the agent develops unintended strategies over time. Using theoretical modeling across simulated enterprise workflows, the study projects that unchecked drift could lead to task success rates dropping by over 40% and human intervention requirements tripling. These are projected figures from simulation, not measured production outcomes. But the underlying framework for why the failure mode compounds rather than plateaus is well-constructed.</p><p>A large-scale empirical survey from UC Berkeley, Stanford, UIUC, and IBM Research (arXiv:2512.04123) gives the clearest picture of how practitioners are responding. Of 306 practitioners surveyed, 68% keep their deployed agents to at most 10 steps before human intervention. 70% adjust the prompts rather than fine-tuning the model. 74% rely primarily on human evaluation. The researchers frame this as a deliberate paradox: reliability is the top development challenge, yet agents are reaching production. The resolution is that teams are not waiting for the reliability problem to be solved. They ship by limiting what agents can do. Constrained autonomy, sandboxed environments, internal deployment first. It works. But it&#8217;s a workaround, not a solution.</p><h2>Not all agents carry the same risk</h2><p>Agents can be broken down into two main categories. The risk profile for each is different.</p><p><strong>Bounded agents</strong> are invoked for a discrete task, produce an output, and hand off to a human or downstream process. Example include: A Cursor session writing a function. A one-shot document summary. The scope is defined. The output is reviewable. Failures are localized. These carry constant risk over time that is largely tied to human review.</p><p><strong>Ambient agents</strong> run continuously and make ongoing judgment calls without a hard stop. For example, inbox triage or continuous competitive monitoring. Basically any workflow where the agent decides what matters and acts on it repeatedly, without a human checkpoint between decisions.</p><p>My competitive research workflow sits between these two. It is repeatable rather than truly continuous, but the expectation of consistency is the same. When I run it on Monday and again in three weeks, I expect the differences in output to reflect differences in the market, not differences in the agent. That&#8217;s not what happened.</p><p>McKinsey&#8217;s 2025 global survey found that 62% of organizations are at least experimenting with AI agents. Only 23% are scaling one in production. The gap between experimentation and scale is not a capability gap. It is a trust, observability, and governance gap.</p><h2>If you&#8217;re running long-horizon agents, here&#8217;s what helps</h2><p>The research and the emerging vendor landscape have converged on a set of mitigation approaches. None of them eliminate the problem but they can reduce risk.</p><p><strong>Context management.</strong> One of the least visible failure modes in long-running agents is context drift. As conversation history grows, reasoning quality degrades before you ever hit a context limit. The industry has settled on episodic consolidation: periodically compressing older context into structured summaries while preserving recent and relevant state. The Agent Drift paper identifies this as one of three mitigation strategies with the strongest theoretical grounding. Anthropic now ships a native compaction API that automates the loop.</p><p><strong>Uncertainty-aware memory.</strong> A January 2026 paper from Salesforce AI Research calls the core failure mechanism in long-horizon agents the Spiral of Hallucination. A small grounding error in an early step gets committed to the agent&#8217;s context. It then becomes a false premise for every subsequent step. Standard self-reflection does not reliably catch this. The model has already accepted the error as ground truth. The proposed fix flags low-confidence steps before they propagate and triggers correction only when needed. Early results showed meaningful reliability improvements on multi-step benchmarks. This is early research. But it is getting at the cause rather than the symptom.</p><p><strong>Checkpointing and interrupt design.</strong> Orchestration frameworks like LangGraph have built explicit checkpointing into their execution model. Agents are defined as directed graphs with typed state and hard interrupt points. A human can review, approve, or reset to a known-good checkpoint at any of those points. This converts a brittle autonomous system into a collaborative one. Carnegie Mellon benchmarks published in late 2025 found that leading agents complete only 30-35% of multi-step tasks successfully. This shows that uninterrupted autonomous execution is not the right default for complex workflows.</p><p><strong>Golden dataset evaluation.</strong> This approach maps most directly to my competitive research problem and our product work. Create a set of representative inputs with human-verified expected outputs. Then run your agent against that dataset on a schedule or before any prompt change goes to production. AWS introduced this at re:Invent 2025 with the general availability of Bedrock AgentCore Evaluations: 13 built-in evaluators, CI/CD pipeline integration for pre-deployment gates, and continuous online evaluation against live production traffic. A demo showed the service detecting tool selection accuracy dropping from 0.91 to 0.3 in production. Without continuous measurement, that degradation is invisible.</p><p>Pushpay documented a real production implementation of this pattern. Their golden dataset covers over 300 representative queries with validated responses. It is continuously curated from actual user interactions and fed into an engineering dashboard. The key word is <em>continuously</em>. A golden dataset that does not evolve with your actual workload tests against past state not current state.</p><p>Beyond AWS, the commercial tooling has matured fast. Braintrust ties production traces and offline experiments to the same scorer library. A production regression automatically seeds the next test cycle. LangSmith integrates human annotation queues with trace replay, letting engineers convert production failures into evaluation cases. Arize offers always-on drift detection at the session and span level. For teams with HIPAA or data residency constraints, Langfuse is the strongest self-hosted open-source option. It was acquired by Clickhouse in January 2026, but the open-source codebase remains active.</p><p>None of this is free. Building and maintaining a golden dataset requires human judgment to define what &#8220;correct&#8221; looks like for open-ended tasks. That is genuinely hard when correctness is partly subjective. Dataset rot is a real risk. The infrastructure to run evaluations continuously has real cost. The tooling can solve the infrastructure problem but the curation problem is still yours.</p><p>For my competitive research workflow, the approach is well-suited. The expected output structure is defined even if the specific content varies. I know what a well-formed competitive analysis looks like. I can score for completeness, source coverage, and structural consistency without specifying exact content in advance. That is an easier evaluation target than most ambient agent tasks.</p><h2>The durability problem</h2><p>The industry has gotten very good at demonstrating what agents can do. It has not gotten as good, so far, at ensuring they keep doing it the same way.</p><p>Gartner projects that 40% of agentic AI projects will fail by 2027. Poor risk controls are cited as a primary cause. That figure will land as a surprise to anyone who has only ever evaluated their agents at a single point in time.</p><p>Narrow, monitored, bounded agents are viable today if you build them with that constraint in mind. Always-on autonomous agents are still waiting on better reliability science, better evaluation tooling, and more organizational honesty about the governance they require.</p><p>The question worth asking before deploying any agent is not &#8220;can it do the task.&#8221; It is whether you can tell when it starts doing the task differently than it did before. And whether you would know before your users do.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Coding interviews in the AI world]]></title><description><![CDATA[The interview should test for the job the candidate will actually do]]></description><link>https://www.stevewhittle.net/p/coding-interviews-in-the-ai-world</link><guid isPermaLink="false">https://www.stevewhittle.net/p/coding-interviews-in-the-ai-world</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Sat, 02 May 2026 19:57:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TUIj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TUIj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TUIj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!TUIj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!TUIj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!TUIj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TUIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:725743,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/196254244?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TUIj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!TUIj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!TUIj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!TUIj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfe0f4b3-168a-45d9-8a03-50bc538498ba_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Are you testing the wrong thing?</h1><p>The interview process is supposed to predict job performance. When the conditions of the interview bear no resemblance to the conditions of the job, you&#8217;re not predicting anything. You&#8217;re running a different experiment and hoping the results transfer.</p><p>They don&#8217;t.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>What we actually test when we ban AI</h2><p>The stated rationale for AI-free coding interviews is reasonable on its surface. We want to see how candidates think. We want to know if they actually understand the problem. We don&#8217;t want someone to paste a prompt and copy the solution.</p><p>The concern is legitimate. The conclusion is wrong.</p><p>Here&#8217;s what an AI-free whiteboard session actually measures: the ability to hold syntax, algorithm structure, and edge cases in working memory simultaneously, under pressure, without tooling, in an artificial environment. That is a real cognitive skill. It just isn&#8217;t the one that determines whether someone will perform well on your team in 2026.</p><p>The cognitive load of software development has shifted. The job now requires a different kind of thinking: decomposing a problem into chunks small enough to prompt effectively, reading AI output critically before accepting it, knowing when the model is hallucinating and why. A candidate who can recite a BFS implementation from memory but blindly accepts a subtly wrong AI-generated solution is a worse hire than one who forgets the exact syntax but immediately spots the flaw in what the model produced.</p><p>We&#8217;ve been measuring recall. We should be measuring judgment.</p><h2>The industry has figured this out</h2><p>In October 2025, Meta began rolling out an AI-enabled coding interview that replaces one of the two traditional coding rounds at the onsite stage. The internal framing is revealing. The format was designed to be more representative of the actual developer environment and also to make LLM-based cheating less effective.</p><p>The problems Meta uses are designed with AI assistance in mind. They&#8217;re harder than a traditional coding question. The bar for what a candidate is expected to produce is higher precisely because they have help. Candidates work in a multi-file codebase they didn&#8217;t write and have to understand quickly. Prompt-and-paste fails immediately in that environment because understanding the existing architecture is the prerequisite for everything else.</p><p>CodeSignal launched AI-assisted coding assessments in May 2025 with a feature that matters more than the AI access itself: a full transcript of every candidate-AI interaction alongside a session replay. You&#8217;re not just seeing what the candidate produced. You&#8217;re watching how they think.</p><p>HackerRank moved in the same direction. Candidates work with AI tools in a controlled environment and interviewers get a detailed view of the problem-solving process, not just the output.</p><h2>What an AI-free interview cannot show you</h2><p>When I think about what I actually need to know about a candidate, four things matter that a no-AI interview cannot surface.</p><p><strong>Prompt quality as a diagnostic.</strong> The way someone frames a problem for an AI is a direct readout of how they think about the problem. Vague prompts reveal vague thinking. A candidate who writes &#8220;fix my function&#8221; tells you something different from one who writes &#8220;this recursive function is hitting a stack overflow on inputs above n=1000, here&#8217;s the current implementation, what&#8217;s the likely cause.&#8221; The second candidate has already diagnosed the problem. They&#8217;re using the AI to confirm and implement. That&#8217;s engineering judgment.</p><p><strong>Verification instinct.</strong> One E7 candidate at Meta watched Claude Sonnet repeatedly hallucinate on a maze problem. The question isn&#8217;t whether the AI was wrong. The question is whether the candidate caught it. Did they know what correct looked like before the model answered? Did they push back? A candidate who accepts wrong output without question is a risk that an AI-free interview will never expose, because you never gave them an AI to accept.</p><p><strong>Task decomposition.</strong> Candidates who performed well in Meta&#8217;s AI-enabled format guided the AI incrementally rather than asking for wholesale solutions. One successful candidate described her approach: start with the core logic as a single function, review it, then build out from there. That instinct, to keep the scope small enough to verify at each step, is exactly how good engineers approach complex problems. It&#8217;s invisible in a no-AI environment because there&#8217;s nothing to decompose for.</p><p><strong>Communication under ambiguity.</strong> Meta&#8217;s internal evaluation criteria for this round includes a phrase that functions as an answer key: &#8220;Should use AI, but need to show you understand the code. Explain the output. Test before using. Don&#8217;t prompt your way out of it.&#8221; That is a rubric for thinking, not for tool use. The candidate is being evaluated on whether they can narrate a reasoning process in real time, hold a conversation with an interviewer while working with an AI assistant, and remain the accountable decision-maker throughout. That skill matters on the job. AI-free interviews don&#8217;t test it.</p><h2>The gaming problem is real but solvable</h2><p>The obvious objection is that AI-assisted interviews are easier to game. A fast prompter with shallow understanding can look strong if the evaluation is just the output.</p><p>That&#8217;s true. But it&#8217;s a design problem, not a fatal flaw.</p><p>Interaction transcripts solve most of it. When you can see the full sequence of what a candidate asked for and how they responded to what they got, shallow prompters reveal themselves quickly. They ask for too much at once. They accept the first answer without testing it. They can&#8217;t explain the code when asked.</p><p>Multi-file codebases with staged checkpoints solve the rest. A problem that requires understanding existing architecture before making any change can&#8217;t be solved by pasting a description into a chat window. The AI doesn&#8217;t have the context. The candidate has to build it, which means they have to understand it.</p><p>The gaming risk in no-AI interviews is equally real and far less visible. A candidate who has memorized LeetCode patterns looks identical to one who genuinely understands algorithms. At least AI transcripts expose the reasoning process. A whiteboard session shows you the answer. A session replay shows you how the candidate thinks.</p><h2>What to change before your next hire</h2><p>None of this requires overhauling your entire process. Three things make the difference.</p><p>Redesign the problem before you change the rules. A standard LeetCode question with AI access is still a bad question. The problem needs to be complex enough that AI assistance is a navigation tool rather than a solution dispenser. Ambiguous requirements, existing codebases, staged checkpoints. Problems designed so that understanding is the prerequisite for prompting.</p><p>Make the AI interaction visible and gradable. If you&#8217;re using a platform, use one that captures transcripts. If you&#8217;re running your own interviews, ask candidates to narrate their prompts out loud and explain what the AI gave them before they act on it. That narration is the interview.</p><p>Keep a short no-AI segment with a clear purpose. Baseline fundamentals still matter. A candidate who can&#8217;t read a stack trace or reason about complexity without assistance is a real risk. A focused no-AI segment to test that floor is legitimate. Don&#8217;t treat that floor as the whole evaluation.</p><h2>The prediction model you&#8217;re running</h2><p>Every hiring process is a prediction model. The inputs are interview signals. The output is a forecast of job performance. When the inputs don&#8217;t reflect job conditions, the model is broken.</p><p>Running AI-free coding interviews for engineers who will spend their careers working alongside AI tools is like running driving tests with no steering wheel because you want to assess balance. The rationale sounds defensible. The instrument is wrong.</p><p>The industry has started correcting. The question is whether your process has.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The bill is coming due - AI coding vendor lock-in]]></title><description><![CDATA[Relying on subsidized token pricing is risky]]></description><link>https://www.stevewhittle.net/p/the-bill-is-coming-due-ai-coding</link><guid isPermaLink="false">https://www.stevewhittle.net/p/the-bill-is-coming-due-ai-coding</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Tue, 28 Apr 2026 02:15:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DLIk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DLIk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DLIk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!DLIk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!DLIk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!DLIk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DLIk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:509373,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/195687167?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DLIk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!DLIk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!DLIk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!DLIk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd760c7-81e0-4110-8de3-c524cf061833_2752x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you&#8217;ve been using AI coding tools over the past two years, you&#8217;ve been getting a great deal. Frontier model access embedded in your IDE, powering your agents, running in your CI pipelines &#8212; for prices that don&#8217;t actually cover what it costs to serve you.</p><p>That&#8217;s not a bug. It&#8217;s a strategy. And strategies change.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>This post is about what happens when the economics of AI-assisted coding get repriced, why that repricing is likely in the next 12 to 24 months, and what your engineering organization should be doing right now before the bill arrives.</p><h2>The Free Ride Won&#8217;t Last</h2><p>According to internal projections reported by the Wall Street Journal, OpenAI does not expect to reach profitability until 2030. Anthropic projects reaching positive free cash flow by 2027 or 2028. Both companies are growing revenue at extraordinary rates. Anthropic recently reported annualized revenue exceeding $30 billion. But revenue and profit are not the same thing, and right now these companies are very far from the same thing.</p><p>The structural problem is inference compute. OpenAI spent roughly 50% of its revenue on inference costs alone in recent years, with training costs pushing total expenditure well above what comes in. Every token you generate costs real money in GPU time. The pricing you see at the API does not reflect what it actually costs providers to serve those tokens.</p><p>Open-weight models of comparable capability are anywhere from 17x to 18x cheaper than Anthropic&#8217;s API pricing and are from providers who are covering their costs and making margins. That isn&#8217;t an indictment of Anthropic&#8217;s business model. It reflects real differences in model capability, trust, tooling maturity, and enterprise positioning.</p><p>The narrative you&#8217;ll hear most often is that inference costs will keep falling and everything will work out. That narrative has been repeated for three years. Inference costs for frontier models have not followed the curve that optimists projected, partly because each new model generation is larger and more capable than the last, which resets the compute baseline. Lower prices on last-generation models don&#8217;t help you if you need current-generation capability.</p><p>The current pricing environment is a competitive land-grab. It&#8217;s not sustainable.</p><h2>The IPO Pressure Cooker</h2><p>Both Anthropic and OpenAI are moving toward public markets. Anthropic has engaged IPO counsel and is reportedly discussing an offering as early as Q4 2026, targeting a raise exceeding $60 billion. OpenAI is targeting a similar timeline at a valuation approaching $1 trillion.</p><p>Private investors fund growth stories and tolerate long paths to profitability. Institutional fund managers running discounted cash flow models do not. The S-1 filing will contain actual unit economics for the first time. Analysts will model gross margins. Price-to-earnings ratios will matter in a way they don&#8217;t when you&#8217;re raising from VCs.</p><p>This creates a specific incentive. Both companies have strong motivation to show margin improvement before listing, not after. The levers available are cost reduction (hard, because compute costs are driven by usage and model scale) and price increases (a decision that can be made in an afternoon).</p><p>There&#8217;s also the Inference Trap: You build the best model, usage surges, inference compute explodes, and you face a forced choice between throttling users, raising prices, or cannibalizing the training compute you need to stay competitive. Anthropic experienced five major platform outages in a single month in early 2026. Claude Code users reported burning through usage allocations far faster than the pricing implied. That&#8217;s the Inference Trap operating in real time.</p><p>The combination of IPO pressure and Inference Trap dynamics makes a repricing event not just plausible but structurally likely. The question isn&#8217;t whether it happens. It&#8217;s whether you&#8217;re ready when it does.</p><h2>What You&#8217;ve Actually Built On</h2><p>Most engineering teams believe they have less AI vendor lock-in than they actually do. The assumption is: swap the API key, update the model name, done. There&#8217;s a lot more to it than that. It gets worse the deeper into agentic workflows you go.</p><p>The lock-in profile varies significantly by use case:</p><p><strong>IDE-embedded tools</strong> like Copilot, Cursor, and Claude Code represent the shallowest lock-in. You could switch IDEs or model backends with little effort. But don&#8217;t underestimate soft stickiness. Developer muscle memory, <code>.cursorrules</code>customizations, team-shared system prompts, and workflow integrations all add switching friction. A price increase here hits developer productivity budgets, which are visible and politically sensitive.</p><p><strong>Agentic coding workflows</strong> are where real lock-in begins. Agentic systems don&#8217;t just call a model &#8212; they build scaffolding around it. System prompts are tuned to a specific model&#8217;s personality and failure modes. Tool-calling schemas are optimized for how that model interprets them. Retry logic and output parsing are calibrated to observed behavior. When you switch models, that scaffolding doesn&#8217;t transfer cleanly. You&#8217;re not changing a config parameter. You&#8217;re running a re-evaluation campaign against your own codebase. Industry data suggests migration costs when provider lock-in forces a move average over $315,000 per project, and that figure reflects situations where teams already had some abstraction in place.</p><p><strong>CI/CD and automated pipelines</strong> carry the highest risk. These are production systems with determinism requirements. Prompts optimized for one model may produce subtly different outputs on another. Those outputs can look similar enough to pass manual inspection but break downstream parsers and validation steps. Model version pinning provides a false sense of stability because providers deprecate models with 90 days&#8217; notice, and there is no guarantee of behavioral equivalence between versions. The fundamental problem is that you cannot treat an LLM call in a production pipeline the same way you treat a deterministic function call. When you switch models, you have to prove the pipeline still works. You cannot assume it.</p><h2>Open Source is a real option, but it has gaps</h2><p>The obvious response to pricing risk is to use open source models, self-host, and pay for compute instead of markup. That path is more viable than it was 18 months ago but has real gaps that tend to be underestimated.</p><p>The capability gap has largely closed on many dimensions. Open models now match or surpass closed models on knowledge benchmarks, mathematical reasoning, and graduate-level science. The gap that remains is concentrated where it matters most for coding: production-level agentic tasks, multi-step software engineering, and complex tool use. On SWE-bench Verified, the most practically meaningful coding benchmark, the best open models are within a few points of frontier closed models. That gap is still an issue at the tail of task complexity.</p><p>The price differential is big. DeepSeek V3.2 is available at roughly $0.28 per million input tokens. Claude Opus 4.7 is $5.00 per million input tokens. That&#8217;s a 17x difference. For high-volume workloads the economics are compelling even accounting for operational overhead.</p><p>But here&#8217;s what the open source advocates undersell: switching models is not the same as switching model providers. The scaffold matters as much as the model. Real-world benchmarks show a 22-point swing on the same task with the same model when you change the agent scaffold and tooling. Switching models requires re-validating your entire system, not just verifying the model output looks reasonable.</p><p>The operational burden of self-hosting is a real cost transfer. Inference infrastructure, model serving with tools like vLLM or Text Generation Inference, GPU provisioning, update cadence, and security patching all fall on your team. For most organizations without dedicated ML infrastructure experience, this isn&#8217;t a savings. It&#8217;s a new operational surface area.</p><p>There&#8217;s also a geopolitical dimension worth naming directly. The strongest open models right now (DeepSeek, Qwen, Kimi) are Chinese-developed. For organizations with data sovereignty requirements, government contracts, or security-sensitive codebases, the lineage of a model matters. This isn&#8217;t a reason to dismiss these models outright, but it&#8217;s a factor that belongs in your architecture decision.</p><h2>The Protocol Layer Is Your Best Friend</h2><p>The most practical near-term lever against lock-in isn&#8217;t switching to open source. It&#8217;s building an architecture that makes switching possible.</p><p>Model Context Protocol (MCP) is the most significant structural development here. Originally developed by Anthropic and then donated to the <strong>Agentic AI Foundation (AAIF)</strong>. This foundation was co-founded by Anthropic, Block, and OpenAI. MCP has achieved something rare: genuine cross-industry adoption. OpenAI abandoned their proprietary Assistants API and adopted MCP. Google DeepMind, Microsoft, and AWS are all on board. When direct competitors converge on a shared infrastructure standard it signals inevitability.</p><p>MCP decouples the agent-tool connection layer from the model layer. Your integrations with databases, APIs, filesystems, and external services are built once against the MCP standard and survive a model swap. That&#8217;s the right layer to standardize at.</p><p>Pair that with an LLM Gateway such as LiteLLM or Portkey, middleware that abstracts provider-specific API differences behind a single interface, and you get a system where the model backend is genuinely swappable without rebuilding your application logic. The marginal complexity cost of adding this abstraction early is low. The switching optionality it creates is high.</p><p>Be honest about what MCP doesn&#8217;t solve though. The protocol handles tool integration, not model behavior. When you swap models, your prompts still need re-validation. MCP can also consume 40-50% of available context window before any actual work begins, which creates real production tradeoffs. Standards help. They don&#8217;t eliminate the work.</p><h2>What You Should Do Today</h2><p>The cost of acting on this now is low. The cost of acting after a pricing shock is high.</p><p><strong>For IDE tools:</strong> Evaluate whether your current tooling is model-agnostic or model-bundled. Prefer tools that let you swap backends. Baseline your developer productivity metrics now. You need a measurement baseline before any changes hit, not after.</p><p><strong>For agentic workflows:</strong> Add an LLM Gateway from the start of any new project. Keep your agent orchestration layer architecturally separate from your model API calls. This is the single highest-leverage structural decision you can make. Build evaluation suites against your own codebase, not generic benchmarks. Generic benchmarks tell you how a model performs in the abstract. Your eval suite tells you whether you can safely swap models in your specific system.</p><p><strong>For CI/CD pipelines:</strong> Treat every LLM call as a third-party dependency with explicit versioning, SLA monitoring, and a tested fallback path. Design for graceful degradation. What does the pipeline do when the model endpoint is slow, unavailable, or has been updated? This should be a documented decision, not an untested assumption.</p><p><strong>Across all use cases:</strong> Audit your current AI spend and its concentration across providers. Most teams have no idea what this number is. Monitor the IPO timelines. The S-1 filings will be the first time the public sees actual unit economics from these companies, and they will move the conversation. Build internal familiarity with at least one open-weight model family. Even if you never deploy it in production, that knowledge reduces the information asymmetry in any future pricing negotiation.</p><h2>A Strategy Note, Not a Panic Note</h2><p>The goal here is not to abandon frontier models. They are genuinely better at certain tasks, the tooling ecosystem around them is more mature, and for many use cases the productivity gains justify whatever they end up costing.</p><p>The goal is not to be surprised. More specifically, the goal is to avoid being in the position of needing to move urgently with no alternatives evaluated and no time to build them.</p><p>Engineering organizations that have done the architecture work to reduce switching costs will have options when prices move. They&#8217;ll be able to make a deliberate choice between absorbing the increase, substituting a capable alternative, or negotiating from a position of real leverage. Organizations that haven&#8217;t done this work will face a different situation: urgent need, unknown switching cost, and a vendor who knows it.</p><p>The bill is coming. The amount is unknown. The only variable you control is how ready you are to pay someone else instead.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Replacing managers with an AI World Model]]></title><description><![CDATA[Can AI completely replace managers?]]></description><link>https://www.stevewhittle.net/p/replacing-managers-with-an-ai-world</link><guid isPermaLink="false">https://www.stevewhittle.net/p/replacing-managers-with-an-ai-world</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Fri, 24 Apr 2026 15:22:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DK_d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DK_d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DK_d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!DK_d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!DK_d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!DK_d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DK_d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:569657,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/195359985?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DK_d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!DK_d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!DK_d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!DK_d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf24efcc-5b78-4598-bbb9-257cec277b14_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Jack Dorsey and Roelof Botha published &#8220;From Hierarchy to Intelligence&#8221; on March 31, 2026. It is a serious piece of thinking. The historical framing is sharp, the diagnosis of why hierarchies exist is largely correct, and the argument that AI changes the information-routing constraint is real.</p><p>But the essay is about one-third of what managers actually do. It treats that one-third as the whole job, removes the people doing it, and calls the problem solved. The other two-thirds are still there. They just don&#8217;t have anyone doing them anymore.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>To understand that better, we need to be precise about what managing actually is.</p><h2>Three Clusters, Not One</h2><p>In 1973, Henry Mintzberg published <em>The Nature of Managerial Work</em>, based on direct observation of what managers do with their time. His finding was that managerial work falls into three clusters: informational, interpersonal, and decisional.</p><p>The informational cluster is what Dorsey and Botha are talking about. Managers monitor what&#8217;s happening, disseminate that information across the organization, and represent the team to the outside world. This is the information routing function. It&#8217;s the layer that hierarchy was built to support, and it&#8217;s the layer that AI can now perform continuously, at scale.</p><p>That&#8217;s true. But it&#8217;s only one of three.</p><h2>What AI Cannot Route</h2><p>The interpersonal cluster covers things that depend on trust, relationship, and human accountability. The leader motivates and develops people, figures out who has potential and what they need to grow, and has the difficult conversations.</p><p>The liaison role involves building relationships across organizational boundaries, the kind of connective tissue that makes cross-functional work actually work. The figurehead role is about legitimacy and accountability. When something goes wrong, someone needs to be responsible in a way a system cannot be.</p><p>A telling data point from the Block restructuring itself: current and former employees told The Guardian that roughly 95% of AI-generated code changes still require human modification. This is in a remote-first, highly digital, machine-readable organization, exactly the environment that Dorsey describes as most amenable to this model. The humans are still in the loop not because the information system failed, but because the work itself requires judgment that isn&#8217;t in the data.</p><p>The decisional cluster is where this becomes even clearer. Mintzberg&#8217;s entrepreneur role involves sensing and acting on opportunities that aren&#8217;t visible yet. A world model, by definition, can only reflect what has already happened. It cannot tell you what to build that doesn&#8217;t exist yet. The disturbance handler role is about responding to crises and genuinely novel situations, exactly the circumstances where pattern-matching on historical data is most likely to fail. Resource allocation and negotiation involve competing interests, trust between parties, and judgment under uncertainty. In regulated industries, financial services, healthcare, any domain with fiduciary obligations, you can&#8217;t delegate these decisions to a system regardless of how good it is.</p><h2>The Risks Worth Naming</h2><p>Beyond the functional gaps, there are a few things worth noting:</p><p><strong>Post-hoc rationalization?.</strong> Block cut 40% of its workforce in February 2026, before the essay was published. The stock jumped roughly 22%. Botha, who co-authored the essay, sits on Block&#8217;s board. Morgan Stanley upgraded Block to overweight after the cuts. Goldman Sachs raised its price target. It is fair to ask whether the intellectual framework followed the business decision or preceded it. That doesn&#8217;t make the argument wrong, but it does mean the incentive to believe the argument is quite strong for the people making it.</p><p><strong>The flat structure graveyard.</strong> Zappos tried holacracy. Valve famously ran without managers. The Spotify model has been widely adopted and widely struggled with. These experiments didn&#8217;t fail because the idea was wrong in theory. They failed because removing formal structure doesn&#8217;t remove the need for coordination. It just moves coordination into informal channels, where it becomes invisible, political, and dependent on whoever has the most social capital. The information routing problem gets solved. The interpersonal and political problems get worse.</p><p><strong>Data completeness.</strong> A world model built from Slack threads, Jira tickets, pull requests, and performance metrics reflects what was written down. A significant fraction of organizational knowledge is never written down. It lives in the judgment calls that didn&#8217;t make it into a doc, the context a senior engineer carries about why a system was built the way it was, the reason a decision was made three years ago that nobody remembers to explain to new people. The model sees the artifact. It doesn&#8217;t see the reasoning behind it.</p><p><strong>Data quality and drift.</strong> This one is distinct from completeness and is arguably more dangerous. Information that was accurate when it entered the system becomes stale. The system continues to present it with the same authority as fresh data. Decisions get made on information that was true six months ago and isn&#8217;t anymore. You don&#8217;t see the error at the time. It shows up later, in ways that are very hard to trace back to their source.</p><p>This is a documented, recurring failure in knowledge management systems generally. It&#8217;s not theoretical. Platforms like Guru have built their core product differentiation specifically around the knowledge freshness problem because the industry learned, repeatedly, that drift is the default. Small errors accumulate in decisions that each look reasonable in isolation, until something downstream breaks in a way nobody can explain.</p><p><strong>Regulatory reality.</strong> For companies operating in financial services or healthcare, the question of whether AI can replace decision-making isn&#8217;t just organizational. It&#8217;s legal. Explainability requirements, fair lending law, fiduciary duty, these don&#8217;t care how good your model is. Decision accountability cannot be delegated to a system in regulated domains. This alone may prevent this type of World Model from being used is certain industries.</p><h2>Where This Is Actually Going</h2><p>None of this means the Dorsey/Botha thesis is wrong. I just think it&#8217;s incomplete.</p><p>The informational cluster is being automated. That is a real and permanent shift. Managers who spent the majority of their time aggregating context, relaying status, and maintaining alignment across teams are already less necessary than they were. That part of the argument makes a lot of sense.</p><p>What&#8217;s more interesting is what happens to the other two clusters as this plays out. If information routing gets absorbed by AI systems, the interpersonal and decisional work doesn&#8217;t disappear. It becomes more visible and it&#8217;s value should be more obvious. Managers who survives this transition are the ones who were always doing the work that was hardest to put on a job description.</p><p>The open question is whether that work changes in character as organizations become more AI-instrumented, or if it simply becomes more prominent because everything else has been stripped away. Does managing people become fundamentally different when the coordination layer is automated? Or does it turn out that the relational, developmental, and judgment-intensive work was always the real job, and the information routing was just the overhead we confused for the substance?</p><p>I don&#8217;t think anyone knows yet. Block&#8217;s Q1 2026 results will be a first data point. If they hit $12.2 billion in gross profit with 40% fewer people, the thesis gets harder to argue with. If they don&#8217;t, their Roman Army comparisons will age badly.</p><p>Either way, the question is worth asking more carefully than &#8220;can AI replace the org chart.&#8221; The org chart was never the point. It was just the structure we built to solve three problems at once. AI can solve one of them. The other two? Let&#8217;s see.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.stevewhittle.net/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Is This Really What You Want to Measure?]]></title><description><![CDATA[Measuring Engineering Outcomes]]></description><link>https://www.stevewhittle.net/p/is-this-really-what-you-want-to-measure</link><guid isPermaLink="false">https://www.stevewhittle.net/p/is-this-really-what-you-want-to-measure</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Wed, 22 Apr 2026 23:19:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pwWd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pwWd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pwWd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!pwWd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!pwWd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!pwWd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pwWd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:530430,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/195183353?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pwWd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!pwWd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!pwWd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!pwWd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4f1dad2-a99e-4408-8460-234b09061e3e_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I&#8217;m reading C. Thi Nguyen&#8217;s new book, <em>The Score</em>, and one idea in it keeps coming back to me. He calls it <strong>value capture,</strong>the process by which a rich, meaningful goal gets quietly replaced by the metric you were using to track it. You start measuring something because it points toward what matters. Then, gradually, the measurement <em>becomes</em> what matters. The original goal doesn&#8217;t disappear. It just stops being the thing that drives decisions.</p><p>That&#8217;s not a philosophy problem. That&#8217;s every day in most engineering organizations.</p><h2>What we actually measure</h2><p>Let&#8217;s be clear about what passes for engineering measurement in most companies.</p><p><strong>Output metrics</strong> are the default: story points, tickets closed, pull requests merged, lines of code written. They&#8217;re easy to collect, easy to visualize, and they feel like signal. The problem is they measure production, not value. A team can close 200 tickets in a sprint and ship nothing a customer cares about. Story points aren&#8217;t a unit of value, they&#8217;re a unit of negotiated effort, and that negotiation starts the moment someone decides to track them.</p><p><strong>Performance metrics</strong> try to go one level deeper: code review turnaround time, sprint commitment hit rate, on-call response time, test coverage percentages. These are more interesting because they reflect process health. But they still measure fidelity to a process, not effectiveness of the work. A team can hit 95% sprint commitment every week by sandbagging estimates. Reviews can be fast because nobody&#8217;s actually reviewing.</p><p><strong>Efficiency metrics</strong> are where most engineering organizations have landed recently, particularly DORA, the four-key-metrics framework: deployment frequency, lead time for changes, change failure rate, and time to restore service. DORA is genuinely useful as a diagnostic. The problem is what happens when it moves from a team-level health check to a leadership dashboard. Deployment frequency gets gamed by trivializing deployments. Lead time gets gamed by where you start the clock. You end up with a team deploying 15 times a day that is still six months from shipping anything meaningful.</p><p>Every one of these metric categories lives entirely inside the engineering system. None of them has a direct connection to whether the engineering organization is actually doing its job.</p><h2>The game you didn&#8217;t know you were playing</h2><p>Nguyen&#8217;s book makes a distinction that hits differently in an engineering context. He separates <strong>striving play</strong> from <strong>achievement play</strong>. In striving play, the goal is to engage fully, the process, the judgment, the craft. In achievement play, the only thing that matters is the score. Great games are designed so that chasing the score also produces striving. You can&#8217;t get good at chess by gaming the scoring system; you actually have to get good at chess.</p><p>Institutional metrics are the opposite. They strip out the magic circle, Nguyen&#8217;s term for the temporary, voluntary frame that makes game constraints feel meaningful rather than oppressive. In a board game, you accept arbitrary rules because you chose to sit down and play. In a work context, those rules aren&#8217;t arbitrary and they aren&#8217;t optional. The score follows you. It shows up in your performance review. It gets presented to the board.</p><p>What&#8217;s left, once you remove the magic circle, is a system that rewards achievement play. And engineers, who are, professionally, some of the best problem-solvers in any room, find the optimal path to the score. This isn&#8217;t cynicism. It&#8217;s a completely rational response to the incentive structure you built.</p><p>The consequences are predictable:</p><p><strong>Velocity becomes sandbagging.</strong> The moment team velocity appears on a leadership dashboard, estimation inflates. Points expand to protect the team. After a few quarters, the number is politically stable and informationally useless.</p><p><strong>Deployment frequency rewards triviality.</strong> If deploying frequently is the metric, the rational move is to break work into smaller pieces, not because small batches are better (they often are, but for different reasons), but because each deployment ticks the counter.</p><p><strong>Commitment rates reward conservatism.</strong> Measure whether a team delivers what they promised and they&#8217;ll promise less. You&#8217;ll see consistently green dashboards and a team that&#8217;s quietly becoming slower.</p><p><strong>Code review speed becomes rubber-stamping.</strong> If time-in-review is visible, reviewers learn to approve fast. Technical debt accumulates invisibly while the metric looks healthy.</p><p>This is Goodhart&#8217;s Law in action, once a measure becomes a target, it ceases to be a good measure. But Nguyen&#8217;s framing adds something important: it&#8217;s not just that the metric gets corrupted. It&#8217;s that people&#8217;s <em>values</em> get reshaped around it. The engineers optimizing for velocity aren&#8217;t lying. They&#8217;ve internalized the metric. The metric has become, for them, what good work looks like. That&#8217;s value capture. That&#8217;s the thing that&#8217;s actually hard to fix.</p><h2>What these metrics are actually telling you</h2><p>This is worth being precise about, because the answer isn&#8217;t &#8220;nothing.&#8221; These metrics have legitimate uses. The failure is the mismatch between what they measure and what leaders use them to decide.</p><p>Output metrics, velocity and tickets closed can tell you if work is flowing through the system at all. A team whose velocity drops 40% in two sprints has a problem worth investigating. What they can&#8217;t tell you is whether the work matters.</p><p>Performance metrics are operational diagnostics. Long code review cycles, high defect escape rates, chronic on-call fatigue, these are real signals about process dysfunction. Treat them as process indicators, not as performance scorecards.</p><p>DORA is pipeline health. Lead time and deployment frequency tell you something real about delivery capability. Change failure rate and MTTR tell you something about resilience. But all four are downstream of the question that actually determines whether an engineering organization is performing: are you building things that move the business, fast enough to matter?</p><p>None of the standard metrics reach that question. The reason is structural. They&#8217;re easy to collect because they live inside the engineering toolchain. Anything that requires a connection to product outcomes, customer behavior, or business results is harder, and that difficulty is precisely why it tends not to get measured.</p><h2>The missing layer</h2><p>The data we reach for first, the data that&#8217;s easy to collect and easy to present, systematically hides what&#8217;s actually going on. The measurement layer that&#8217;s missing isn&#8217;t more engineering metrics. It&#8217;s a feedback loop that closes outside engineering.</p><p>A few things that would actually tell you something:</p><p><strong>Outcome linkage.</strong> Can you connect a shipped feature to a measurable change in user behavior or business results? Not &#8220;we shipped it&#8221; but &#8220;after we shipped it, the thing it was designed to move, moved.&#8221; This requires instrumentation, a documented hypothesis before work starts, and a willingness to wait. None of those come naturally to sprint planning cadences. But without them, you&#8217;re measuring production, not impact.</p><p><strong>Flow efficiency.</strong> The ratio of value-added time to elapsed time in your delivery process is more interesting than raw velocity. A feature that takes 12 weeks from idea to production, with only 2 of those weeks involving actual engineering work, has a flow efficiency of 17%. That&#8217;s a systems problem &#8212; and throughput metrics will never surface it. The bottleneck isn&#8217;t in the work; it&#8217;s in the waiting.</p><p><strong>Technical health as a leading indicator.</strong> Complexity trends, dependency staleness, incident frequency &#8212; these are imperfect but directionally useful signals about whether your codebase is getting easier or harder to extend. Engineering organizations that ignore these tend to see velocity collapse right when the business needs them to accelerate. It&#8217;s not a coincidence.</p><p><strong>Team capability over time.</strong> Are engineers growing? Are they retaining? Are they increasingly autonomous, or increasingly dependent on a few specialists? These lag badly as indicators, but they&#8217;re leading indicators of whether the organization will still be functional in two years. No sprint metric captures them.</p><h2>Breaking the cycle</h2><p>Swapping bad metrics for better ones doesn&#8217;t solve the problem. If you replace velocity with DORA and keep the same incentive structure, you&#8217;ll get gamed DORA metrics in six months. Nguyen is clear on this in <em>The Score,</em> once value capture takes hold, the answer isn&#8217;t a better score. It&#8217;s rebuilding the conditions under which people can reclaim their own values.</p><p>In an engineering context, that means a few things:</p><p><strong>Separate diagnostic metrics from evaluation metrics.</strong> Deployment frequency is useful when a team uses it to understand their own pipeline. It becomes corrosive the moment it appears on a leadership report as a proxy for team performance. The same number. Completely different effect depending on who it&#8217;s for and what it drives.</p><p><strong>Measure outcomes and accept the latency.</strong> This requires leaders to resist the urge to instrument everything that can be instrumented. Define what you&#8217;re trying to move before work starts. Measure it after you ship. Accept that the feedback loop is slower than a quarterly review cycle, and if your performance management cycle is shorter than your product feedback cycle, that&#8217;s the real problem to fix.</p><p><strong>Make the system visible, not just the throughput.</strong> Flow efficiency, incident trends, and technical debt trajectories give teams and leadership a shared picture of systemic constraints. When the conversation shifts from &#8220;why didn&#8217;t you close more tickets&#8221; to &#8220;what&#8217;s blocking flow and what would it cost to fix it,&#8221; you&#8217;re at least asking questions that can produce useful answers.</p><p><strong>Let the teams define the metrics for their own work.</strong> This one is underrated. The people closest to the work know what signals matter and which ones can be gamed. Metrics designed by a team to understand their own performance are completely different from metrics imposed from above to evaluate them. The former creates accountability. The latter creates the conditions for value capture.</p><p><strong>Hold leadership accountable for outcome clarity.</strong> A lot of engineering metric gaming exists because the business hasn&#8217;t clearly defined what success looks like. If product leadership can&#8217;t say what a feature is supposed to change, and how they&#8217;ll know it worked, the engineering team will fill that vacuum with whatever proxy feels safest. Measurement quality is a leadership problem as much as a measurement problem.</p><h2>The question the dashboard can&#8217;t answer</h2><p>Look at whatever engineering metrics you&#8217;re currently tracking. For each one, ask: what decision would I make differently if this number were 20% better or worse? If the answer is &#8220;I&#8217;d evaluate someone&#8217;s performance differently,&#8221; the next question is whether you&#8217;re actually measuring what you want to optimize for, or just what&#8217;s available.</p><p>Nguyen ends <em>The Score</em> with the question the title comes from: is this the game you really want to be playing? Most engineering organizations never ask it. The dashboard is there. The numbers are green or red. The sprint review happens. Nobody stops to ask whether the game itself is worth playing.</p><p>Measuring the wrong things precisely is worse than not measuring. It gives you the confidence of data without the benefit of insight. The metric becomes the goal. The goal becomes the metric. The original purpose, building something that matters, gets quietly replaced. And everyone in the room is looking at the dashboard, wondering why the product still isn&#8217;t getting better.</p>]]></content:encoded></item><item><title><![CDATA[AI Coding and Context Tax]]></title><description><![CDATA[Generation is not Comprehension]]></description><link>https://www.stevewhittle.net/p/ai-coding-and-context-tax</link><guid isPermaLink="false">https://www.stevewhittle.net/p/ai-coding-and-context-tax</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Wed, 22 Apr 2026 21:24:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1Fgg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Fgg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Fgg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!1Fgg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!1Fgg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!1Fgg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Fgg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:496826,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/195174123?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Fgg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!1Fgg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!1Fgg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!1Fgg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F933f2d08-d32c-4256-ad29-a72118e3154e_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When writing code using AI assistance, it&#8217;s easy to generate a lot of code in a very short amount of time. As the development process goes on, even more code will be created. This gets complicated when it comes to maintaining and expanding the code base. Over time the person working on the code has to get back into that code to address bugs or add features. Getting to the point where you understand the code well enough, again, takes time. I have seen that while it&#8217;s easy to create a lot of code with AI, it&#8217;s not as easy to come back into that code after it&#8217;s been created.</p><p>One thing we need to keep in mind when using AI to generate code is that now there becomes a difference between writing the code and understanding the code. In pre-AI software development, where you needed to manually write the code, there was a certain understanding of what you were writing while you were writing it. With AI, that process goes away. The code is created, and you then need to understand that code and become familiar with it so that you can then make updates.</p><p>When you write the code yourself, your brain is creating an internal model of that system that you&#8217;re creating. When you step away from that code and come back, some of that that internal model, will persist. Interacting with something, rather than just reading it, tends to be retained in the brain longer. Since you&#8217;ve written the code yourself, it&#8217;s easier to come back into that code and continue developing. If you have not gone through that process, if you are reviewing code that was written by AI or by another person, then you&#8217;re establishing familiarity with the code each time without having that pre-existing internal model. This makes it more difficult and takes longer to come up to speed with that code so you can continue development.</p><h2>Protecting Comprehension</h2><p>The answer isn&#8217;t to use AI less. It&#8217;s to recognize that AI handles code generation but cannot handle comprehension. Comprehension doesn&#8217;t transfer. Each of the techniques below are just a mechanism for forcing you to build a genuine internal model of the code, not just review it from the outside. If you spend 2 hours building something with AI that would have taken 2 days manually and then spend 45 minutes re-establishing context every time you return to it, the break-even on that speed gain comes faster than you think.</p><p><strong>1. The spec-first inversion</strong></p><p> The conventional flow is: prompt &#8594; code &#8594; review. Reverse the order. Before asking the AI to generate anything, write a 3&#8211;5 sentence description of <em>what the code should do and why</em>. Not a formal spec &#8212; just externalizing your intent. That document becomes your re-entry anchor, and it forces the model to work from your understanding rather than producing something you have to reverse-engineer.</p><p><strong>2. Treat code review as active encoding, not QA</strong> </p><p>When you review AI-generated code, the goal shouldn&#8217;t just be catching bugs &#8212; it should be deliberately building the internal model you didn&#8217;t get from writing it. Don&#8217;t rubber-stamp diffs. Walk the logic path. Rename things that don&#8217;t reflect <em>your</em> mental model. The review is the learning; skip it and you&#8217;re adding to the context debt, not paying it.</p><p><strong>3. The commit message as cognitive snapshot</strong> </p><p>Write commit messages as if you are explaining the change to yourself six months from now &#8212; not what changed, but <em>why</em> and <em>what you understood about the system at the time</em>. This is cheap, async, and gives future-you a scaffold that&#8217;s tied to the exact moment of maximum context. Git log becomes a context recovery tool, not just a change log.</p><p><strong>4. AI-assisted re-entry</strong> </p><p>Use AI to rebuild context, not just create code. Paste the module into a fresh session and ask: &#8220;Explain what this does, what it assumes, and what would break if X changed.&#8221; That AI-generated explanation &#8212; corrected where wrong &#8212; becomes your working model. You&#8217;re using the same tool that created the debt to help pay it down.</p><p></p><p>What we&#8217;re seeing with AI assisted software development is that it is not a panacea. AI is changing the mental effort required to maintain understanding over time. It&#8217;s not just about how quickly we can create the code. It&#8217;s about being able to sustain that comprehension. The key takeaway here is to use AI to accelerate coding, but don&#8217;t outsource the understanding of that code. It needs to be understandable and comprehensible over time by humans after the code is written.</p>]]></content:encoded></item><item><title><![CDATA[The software development bookshelf]]></title><description><![CDATA[I&#8217;ve been doing some thinking about how software development has changed since the introduction of AI for code generation.]]></description><link>https://www.stevewhittle.net/p/the-software-development-bookshelf</link><guid isPermaLink="false">https://www.stevewhittle.net/p/the-software-development-bookshelf</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Sun, 19 Apr 2026 23:25:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jTcq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jTcq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jTcq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!jTcq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!jTcq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!jTcq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jTcq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:751391,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/194741594?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jTcq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!jTcq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!jTcq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!jTcq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7119e4c1-b6ce-46e1-b43d-f638c11144bc_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I&#8217;ve been doing some thinking about how software development has changed since the introduction of AI for code generation. Having been through a lot of these changes recently with building a new product, I think that it&#8217;s become a lot more interesting, but it&#8217;s also highlighted some important parts that may not have been given the importance they deserve.</p><p>The way I look at software development now is like a bookshelf. You have bookends on either end with books in between. In this metaphor, the LLM or the AI system is the books. This is what generates actual code, whether you&#8217;re using Claude Code or OpenAI, Google or anything else. The bookends are the parts that hold up that code.</p><p>I see the bookends as follows:</p><p>The first bookend is the definition of what you actually want to build. This was always important in traditional software engineering because engineering resources were scarce. The amount of time to develop something was fairly long so there was a lot of analysis and discussion and definition ahead of time, sometimes too much. You would spend months talking about something only to realize that you&#8217;ve missed the window and you didn&#8217;t need it anymore. That kind of fell a little bit by the wayside as vibe coding came about. People could just type in a one- or two-sentence prompt and get back something that was kind of what they wanted, but not really. The second bookend is the verification and the validation. You&#8217;ve given it clear instructions but AI is a probabilistic system so you are going to get something that&#8217;s probably very close to, but not exactly what you need. That verification process requires a human to look at the output with human-level context. They needc to look at what&#8217;s been built and validate that it did the right thing, that it built what they thought it was going to build. The human must then make changes, maybe small changes, maybe big changes, to get that code or that application really production ready. Without those bookends, if you look at just the way some people have approached coding, you end up with garbage in, garbage out. Without the human aspects on either side, then the AI is going to build something but it may, in fact, probably will not build exactly what you want.</p><p>So with all this you might be thinking, well, what are we really getting with AI coding? In a short: speed. We&#8217;re also getting a fair amount of accuracy.</p><p>If you&#8217;ve properly defined the problem, the AI system may generate thousands, tens of thousands, hundreds of thousands of lines of code. Without AI just the physical typing of that can take weeks or months, also taking into account things like typos and people getting tired. This is oversimplifying, but it is the manual work that needs to be done to build an application. What AI is doing is automating that manual piece. Automation allows you to do the right thing really quickly but it also allows you to do the wrong thing very quickly. Even though it can do a lot of that bulk work, we still need that validation piece.</p><p>When we talk about the bookend work that humans need to do in this process, this requires a certain amount of domain knowledge, skill, and overall knowledge of how systems work. This generally requires someone who has experience.</p><p>The problem that we&#8217;re starting to run into is if we look back again at our bookshelf, the book portion, the actual typing of code and debugging typos and logic issues was typically something that was done by entry-level software developers. By doing this, they would gain experience and they would be able to move up to be more senior software developers.</p><p>The challenge that we&#8217;re going to face is that, with the work that may typically have been done by junior developers now being done more by AI, you end up with a need for fewer junior developers. We&#8217;re starting to see now that companies are not hiring as many of those. That is short-sighted because when the current senior software developers are gone, retired, left the company, etc., you don&#8217;t have enough of those junior software developers to move up into those senior roles.</p><p>This is a problem that has not yet been solved. There needs to be a better way to address this because we still need a pathway for people to enter the industry to do those bookend tasks. It will be interesting to see how the industry evolves around that.</p><p>The latest and greatest coding models get better and better at generating code all the time. This is great, and when you look at code generation from this point of view, the frontier models are still helpful. However, a lot of the code that&#8217;s going to be generated, assuming you&#8217;ve clearly documented the problem and you&#8217;re validating the results, may not require as powerful of a model as you might think. This will depend on the code that you need to generate. If you&#8217;re doing something like medical diagnoses or tackling physics problems, or dealing with vast amounts of data then yes, you want a very high-power model. I suspect that a simpler model may do just as well in many cases, assuming you have those bookends in place. I don&#8217;t have hard data on this but I will be interested to see how things evolve.</p>]]></content:encoded></item><item><title><![CDATA[The Ticket is Dead, Long live the Spec]]></title><description><![CDATA[The Core Shift: AI Collapses the Cost of Implementation, But Not the Cost of Deciding What to Build]]></description><link>https://www.stevewhittle.net/p/the-ticket-is-dead-long-live-the</link><guid isPermaLink="false">https://www.stevewhittle.net/p/the-ticket-is-dead-long-live-the</guid><dc:creator><![CDATA[Steve Whittle]]></dc:creator><pubDate>Fri, 17 Apr 2026 15:10:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IZsI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IZsI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IZsI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!IZsI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!IZsI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!IZsI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IZsI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:354769,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.stevewhittle.net/i/194528028?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IZsI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!IZsI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!IZsI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!IZsI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384f4be1-3cae-435c-8281-473ac00a4a4a_2816x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As AI-assisted coding gains wider adoption, we need to look at the tools we use to manage the development lifecycle. Historically, the software world has lived in ticketing systems like Jira. This made sense when code was the bottleneck; Jira allowed us to define, plan, and track the manual labor of writing code. When the &#8220;act&#8221; of coding required significant time and resourcing, tracking velocity and story points provided a necessary although imperfect visibility into progress.</p><p>However, with AI now drastically reducing the effort required to generate code, our legacy metrics&#8212;points, velocity, and story size&#8212;are starting to break down. We have an opportunity to rethink the development process entirely.</p><p>While bugs and features both require attention, the impact of AI is most profound in the feature development process. Automation is a double-edged sword: it helps you do the right thing quickly, but it also helps you do the wrong thing faster. To avoid the latter, we must shift our focus to three core phases: Specification, Development, and Verification.</p><h3><strong>Specification: Raising the Bar</strong></h3><p>In an AI-augmented workflow, the &#8220;size&#8221; of a ticket changes. We are no longer limited to small, rigidly defined bits of functionality. We can now deliver much larger chunks of work in a single pass. This shift, however, raises the bar for our upfront definitions.</p><p>In a Jira context, the ticket should serve primarily as a tracking mechanism that anchors the specification&#8212;whether that spec lives in the ticket itself or is linked via Notion or Confluence. The goal is to create a &#8220;source of truth&#8221; that defines the feature with enough clarity that an AI can execute it and, more importantly, so a human can verify the result. The planning and point-assigning &#8220;middle&#8221; of the process is becoming less relevant; the real work is now happening in definition and verification.</p><h3><strong>Development: Capturing the Thought Process</strong></h3><p>We cannot treat AI development as a &#8220;black box&#8221; where a spec goes in and code comes out. While AI-assisted coding frees engineers from the drudgery of syntax and manual typing, the engineer&#8217;s role as a &#8220;guide&#8221; is more critical than ever.</p><p>We need to capture the thought process behind the implementation. Tools like Claude Code and Cursor already allow us to export session data that details how an engineer navigated a problem, where the AI stumbled, and how it was corrected. By automatically appending these session summaries to Jira tickets, we can maintain a complete audit trail of the engineering logic without adding manual overhead for the developer.</p><h3><strong>Verification: The New Bottleneck</strong></h3><p>If AI has removed the bottleneck of writing code, it has moved it to verification. The temptation with AI tools is to move fast and &#8220;test it in production&#8221; to see if it works. This is backward.</p><p>Because of the potential for AI hallucinations and the risk of building on top of incompletely defined requirements, rigorous verification is now more important than it was in the manual era. If we accept poorly defined or unverified code into our codebase today, we are simply compounding technical debt at an accelerated rate.</p><h3><strong>Closing the Loop</strong></h3><p>This shift isn&#8217;t something we need to wait for Jira or other vendors to solve. You can implement this workflow today by using existing tools as simple tracking mechanisms for robust specifications and automated session logs.</p><p>Over time, I expect legacy ticketing features focused on granular implementation steps to be sunsetted or deprecated as the industry moves toward this new model. By focusing on the feedback loop&#8212;using AI to improve specifications and ensuring rigorous verification&#8212;we can ensure that &#8220;faster&#8221; also means &#8220;better.&#8221;</p>]]></content:encoded></item></channel></rss>