When all you have is a Hammer
All problems look like nails
Nobody set out to discover a better architecture for AI systems. It happened because someone typed “write me a function that parses this JSON” and the result just worked.
The LLM ran once. It interpreted the intent, produced code, and exited. The code ran in CI, in production, on a schedule. No LLM in the loop. Nobody called it an “AI agent.” They called it software.
This was not a product decision. It was not an architectural principle. It was the path of least resistance. The developer got what they needed and moved on. The LLM was done before the workflow started.
That accident of framing contains the most important architectural insight in AI systems today. Most teams building agentic workflows have completely missed it.
The hammer is real. The grip is the problem.
The capabilities are genuine. An LLM can interpret ambiguous human intent, navigate unstructured input, and produce structured output from near-infinite input variations. Those are hard problems. LLMs solve them well.
The mistake is assuming that because an LLM can do something at design time, it should keep doing it at every execution. On every single run.
Maslow’s famous observation was that if your only tool is a hammer, every problem looks like a nail. The AI industry’s current posture is a precise instance of this. We have a remarkable tool. We are applying it to every stage of the pipeline, including stages where it has no business being.
The problem is not that LLMs are unreliable. The problem is that probabilistic systems do not belong in deterministic execution paths.
What “probabilistic in the execution path” actually costs
The math is unforgiving. A 20-step agentic workflow where each step succeeds 95% of the time produces an end-to-end success rate of 36%. This is not a model quality problem. Better models improve the per-step number. They do not change the compounding structure. The architecture is the problem.
Kambhampati et al. established the theoretical foundation for this in their ICML 2024 paper on LLM-Modulo frameworks. Auto-regressive LLMs cannot, by themselves, do planning or self-verification. The LLM is a probabilistic knowledge source. Treating it as a deterministic executor is a category error.
Mehta’s recent consistency research on SWE-bench sharpens this further in a way most people have not fully processed. Across 150 agent trajectories, the finding was not that inconsistent models fail more. It was that consistency amplifies outcomes rather than guaranteeing them. Claude achieved 58% accuracy with 15.2% behavioral variance. When it interpreted a task correctly, it succeeded on 100% of runs. When it got the interpretation wrong, it failed on 100% of runs. The same wrong answer, five times in a row, with high confidence.
A model making the same incorrect interpretation on every run is worse than a randomly failing one. The failure is invisible until it compounds, and reflection-based recovery cannot help if the initial interpretation is wrong.
ReliabilityBench (Gupta, 2026) adds the production dimension. Agents achieving 96.9% accuracy on clean benchmarks drop to 88.1% under realistic perturbations. The paper’s broader finding: metrics on clean data overestimate production reliability by 20 to 40 percent. The gap is not a model problem. It is a measurement problem that reveals an architectural one.
Better models improve the per-step number. They do not change the compounding structure.
The pattern developers discovered by accident
Code generation is the canonical LLM-as-compiler case. The LLM interprets intent and emits a deterministic artifact. That artifact runs without the LLM. Nobody designed this as an architecture. It emerged because “write me code” has a natural exit condition: the code either works or it does not.
The insight generalizes far beyond code. A policy document can become an OPA rule set. A process description can become a Temporal workflow definition. Decision criteria can become a scoring model. The LLM’s job in each case is translation from human intent to a deterministic artifact. Once the artifact exists, the LLM is done.
Khattab et al.’s DSPy framework from Stanford is the explicit formalization of this idea. Compile the prompt and retrieval logic into an optimized, repeatable program rather than invoking raw LLM calls at runtime. The artifact owns the execution. The LLM’s work happens once, at design time.
The reason this pattern is underused outside of engineering contexts is not that it does not apply. It is that business users do not yet have the intuition that “the AI is finished” is a valid and preferable state. Finished means auditable. Finished means version-controlled. Finished means you can write a test for it.
When the LLM does belong in the path
There is a class of workflow where LLM-as-compiler fails. The failure mode is unstructured with novel input at execution time.
Customer support is the clearest case. You can compile decision trees for known issue categories. You cannot compile a response to a customer describing a problem you have never seen, in language you cannot predict, with emotional context that changes the appropriate answer.
The same logic applies to open-ended document classification where the category space is not closed; to real-time anomaly detection on free text; to any workflow where the output depends on semantic understanding of content that did not exist when the workflow was built.
The LLM belongs in the execution path when the input space at runtime is genuinely open-ended. That is a much smaller category than most people think.
The question is not “can an LLM handle this at runtime?” The question is “does this workflow require runtime interpretation, or was the interpretation problem already solved at design time?”
The decision rule
Can you fully describe the execution logic as a flowchart, rule set, or decision tree before the workflow runs?
If yes, the LLM’s job is to build that artifact. Not to approximate it on every execution.
If the answer is no, because the logic depends on content that cannot be anticipated or because the decision space is genuinely open, then the LLM belongs in the path.
The key to reliability is to apply this test at all times and minimize LLM surface area at runtime. Not because LLMs are bad. Because deterministic systems are auditable, testable, version-controllable, and do not compound errors across steps.
Most teams are not asking this question. They are asking “how do we make the agent more reliable?” That is the wrong starting point.
If you are running an LLM on every execution of a workflow that could have been compiled into a deterministic artifact, you have not built an AI system. You have built a reliability problem with an AI-shaped interface.
The first question to ask is not whether your LLM is good enough. It is whether it needs to be there at all.

