The Floor Rises

What happens when AI moves off the Cloud?

Jun 18, 2026

At WWDC this month, Apple introduced a piece of software it calls the system orchestrator. It routes every AI request to the model that should handle it, on the device or in the cloud, based on how much compute the task needs and how sensitive the data is. Craig Federighi called it the key to the system’s privacy architecture.

That routing decision is the most consequential architectural shift in consumer AI since the transformer. Not the model. The decision about where the model runs.

Cloud by default was never a decision

From 2022 through 2024, anything that looked like reasoning went to an API. GPT-4, Claude, Gemini all lived in central data centers. On-device AI meant autocomplete and spell check. The hardware floor on a phone was too low to run a capable model, so the model lived somewhere else.

The constraint felt permanent because it had always been true but Always has does not mean always will. We made the same error with local storage, with video on mobile, with encryption in the browser. The thing that is too expensive to do locally stays in the data center right up until it doesn’t.

The floor is already higher

Look at what already left the cloud. Real-time transcription runs on your phone. Translation runs on your phone. Photo understanding, system-wide dictation, short-form classification, writing suggestions. None of these phones home anymore. Apple’s second on-device model, the more capable one for current Apple silicon, understands speech, reads text and images, and never leaves the device.

We should measure the floor by what it can do, not by its parameter count. In 2023 the floor was “can it follow an instruction.” In 2026 the floor is “can it handle most single-document knowledge work without a network connection.”

The reason this happened so fast is the efficiency curve underneath. Sparse activation, distillation, and aggressive quantization decoupled capability from cost. A sparse model carries a large parameter count but uses only a fraction of those weights on any given token. The headline says twenty billion parameters. The actual work on a prompt might touch a few billion. That gap is the trick that put a real model on a phone.

The floor doesn’t rise on a smooth line. It jumps when a new efficiency technique lands and then sits flat until the next one. Mixture-of-experts was a jump. Quantization down to four bits was a jump. Nobody knows the slope of the next five years but he direction is obvious.

Routing is the product, not the model

The orchestrator was the most underreported thing Apple showed. The press wrote about Siri’s new voice. The architecture was the story that was glossed over.

What does the orchestrator decide? Latency tolerance, privacy sensitivity, task complexity, and the hardware tier of the device in your hand. Send the easy local thing local. Send the hard private thing to Private Cloud Compute. This is not a new problem in infrastructure terms. A CDN decides whether to serve from the edge or the origin. A database decides whether a read hits a replica or the primary. Routing a request to the cheapest box that can satisfy it is the way we’ve always done it.

The new part is what the routing decision means. A CDN serving a cached asset gives you the same asset. An orchestrator routing to the small model gives you a different answer than the big model would have. The routing decision now changes the quality of the intelligence, not just the latency. That is a category of routing we haven’t built governance for.

Whoever owns the routing layer owns the usage signal. The model that gets called is the model that sees the data, collects the feedback, and improves fastest. Apple makes that decision for more than a billion devices.

This is how Apple played it. The next-generation foundation models are being co-developed with Google’s Gemini, and the cloud tier runs on Nvidia GPUs inside Google’s cloud. Apple did not insist on owning the best model. It rented the model and kept the layer that decides which model you reach. The model is rentable. The routing layer is not. This is the key. If somewhere down the road, Apple decides to switch to their own cloud based AI, they can. This is what happened with mapping software, Apple partnered with Google until it developed its own. Who knows if Apple will develop their own models but if they do, they can turn it on very quickly.

The open question is whether the orchestrator stays an honest dispatcher or becomes a toll booth.

Regulation is fragmenting AI by geography

Consumer privacy preference is real and soft. People say they care and then tap accept. Regulation does not depend on what anyone prefers. It is hard, and it is fragmenting AI deployment along national lines.

Look at the Siri example: At WWDC, Apple announced that Siri AI will not ship in the EU with iOS 27. Apple framed the block as privacy and security. The cause is actually the Digital Markets Act, which is competition law, not privacy law. The European Commission rejected Apple’s framing in public and without hedging. Nothing in the DMA blocked the launch. The delay was Apple’s own choice. The real objection is that Siri AI, which is powered by Google’s Gemini, would lock out competing assistants on a platform that reaches hundreds of millions of users. The Commission’s spokesman named the Gemini detail on purpose.

This is not “privacy law is forcing AI onto devices”. It’s deeper than that. AI deployment is fragmenting by jurisdiction, and the pressure comes from several different bodies of law at once. Competition law in Brussels. Data localization in India, China, Saudi Arabia, and Brazil. A model that runs locally sidesteps an entire class of these fights because the data never crosses a border. The on-device model is not just faster and more private. It is jurisdictionally invisible.

This is where sovereignty stops being a slogan. A regulated enterprise or a government does not want a privacy assurance about someone else’s data center. It wants the weights, the inference stack, and the audit trail inside its own borders. Private Cloud Compute, however well it is engineered, is still someone else’s cloud in someone else’s country. The requirement is on-premise open-weight deployment. Models like Llama and Mistral are important here for a reason that has nothing to do with topping a benchmark. You can run them where the law says the data has to stay.

US export controls accelerate all of this. A country that cannot count on access will build or source its own. The recent retraction of Anthropic’s Fable 5 model is real world example of this. Scarcity is a fast teacher of self-reliance. The open question is whether on-device and on-premise can satisfy the provenance and audit demands that regulators will have.

Losing the commodity tier is not the threat

The tasks leaving the cloud were never the high-margin tasks. The person summarizing email on a phone was not paying for an API call. Losing that load costs the frontier labs close to nothing.

The real exposure is the middle. Function-level code completion. Document question-and-answer. Single-session knowledge work. These are mid-complexity and high-volume, and they are moving on-device, likely inside two years. The timing will vary based on the use case.

The floor does something quieter to the rest. When good-enough AI is free, private, and already in your pocket, every upgrade has to clear a higher bar to earn a dollar. Apple’s orchestrator teaches a billion people that AI is the free thing that already lives on the phone. That reshapes what AI is worth in the median person’s head before any frontier lab gets to make its pitch.

Volume and revenue are not the same curve. A frontier provider can lose sixty percent of its query volume and far less than sixty percent of its revenue, as long as it priced the commodity tier near zero in the first place. The query that left was the query they weren’t making money on anyway.

So who is actually exposed? Not the top labs but the mid-tier API resellers and the vertical SaaS products whose only moat was access to commodity inference. When the inference is free and local, the wrapper has nothing left to sell. The move for the labs at the top is to go up. Own the reasoning and the agentic layer. Concede the commodity tier without a fight, because fighting for it means competing with free. The open question is whether “own the reasoning layer” is a durable position or just the next floor waiting to rise under it.

The grid did not die. Its economics did.

The useful precedent is electricity, not software. As the electrical grid matured, generation moved closer to the load. Rooftop solar, home batteries, and the central grid now coexist. Nobody decommissioned the grid. What changed permanently was the economics of building another central plant when the marginal load could be served closer to home.

The analogy holds here too. Local handles the frequent, ambient, low-stakes load. Central handles the peak, the complex, the high-stakes work. Neither one kills the other.

But here is where the analogy breaks: Electricity is fungible. A kilowatt-hour off your roof and a kilowatt-hour off the grid are the same kilowatt-hour, and your refrigerator cannot tell the difference. Intelligence is not fungible. The answer from the local model and the answer from the frontier model are different answers, sometimes meaningfully different, and the user often cannot tell which one they got.

The analogy gives us the shape and then fails at the load-bearing joint. The floor rising does not make the ceiling matter less. It makes the gap between floor and ceiling the thing you are selling. The product stops being “AI.” The product becomes the difference between the AI everyone has for free and the AI that is worth paying for. Pricing that gap, and proving it exists on any given task, is the whole game now.

Where do we go from here?

We can see the floor rising. We can see the gap becoming the product. What’s still unknown is who governs the routing layer.

Today Apple decides for Apple devices. Google decides for Android. Microsoft decides for Windows Copilot. None of these is neutral infrastructure. Each one is a company that also sells models, which means each routing decision is made by a party with a stake in the outcome.

The flywheel is the important part. The model that gets called gets the usage signal. The usage signal becomes the improvement that makes it the model worth calling next time. Whoever holds the orchestrator holds the flywheel. Right now the platform owners are holding it, and they did not have to buy a single frontier lab to do it.

If routing gets captured at the platform layer, the frontier labs become wholesale suppliers to a retail counter they do not own. They will have built the best engine in the world and handed the dashboard to someone else.

Steve Whittle

Discussion about this post

Ready for more?