Same Input, Different Answer

You Didn't Buy Software. You Hired a Consultant.

May 10, 2026

Most businesses deploying LLMs are treating them as a new category of software. The governance model they’re importing assumes a simple contract: same input produces same output, every time, reliably enough to stake compliance on. For example: procurement checklists, SLAs, acceptance testing, and audit trails.

LLMs are probabilistic. This breaks the model.

The setting that doesn’t do what you think it does

The standard response to LLM nondeterminism is to set temperature to zero. That’s supposed to make the model deterministic. It doesn’t.

Researchers at ACL’s Eval4NLP workshop published empirical results showing accuracy variations of up to 15% across runs with supposedly stable settings. The best-to-worst performance gap on some tasks reached 70%. Thread scheduling, parallel inference, mixture-of-experts routing, and server-side model updates all introduce variance that temperature settings cannot touch. Temperature=0 means you asked for determinism. It does not mean you got it.

Now chain three agents together, each operating at 90% reliability. Your combined system accuracy is around 73%. Add a fourth agent and you’re below 65%. Probabilistic error compounds through a pipeline the same way latency does. Princeton’s Language and Intelligence Lab studied real-world agent failures against benchmark performance and found this gap is systematic. Not anomalous.

This is not fixable with better prompting. It’s architectural.

Temperature is just the most visible example of a broader pattern. Every instinct your team has about fixing nondeterminism will reach for a deterministic lever: tighter prompts, stricter validation, more retries etc. These are reasonable engineering responses to a problem that is not an engineering problem. The variance is not a bug you can tune out. It is a property of the system. That distinction matters because it changes what a correct response looks like. You can’t fix it. You need to govern it.

The contract nobody disclosed

Enterprise software has always carried an implicit contract: behavior is repeatable, attributable, and auditable. SOX compliance assumes it. SLAs codify it. Acceptance testing verifies it. When your ERP runs payroll, you expect identical outputs for identical inputs. The entire compliance stack is built on that expectation.

When you deploy an LLM into a business process, you break that contract.

Gartner’s AI Hype Cycle 2025 explicitly calls nondeterminism a core enterprise AI risk and projects that more than 40% of agentic AI projects will be canceled by 2027. A separate Gartner prediction from March 2026 goes further: without explainability foundations, GenAI will be restricted to low-risk tasks only.

The internal consequence is concrete. Your finance agent interprets policy differently on Tuesday than it did on Monday. Your audit trail stops being a record and becomes a probability distribution. This Six Sigma Agent paper cites an MIT GenAI Divide Report finding that 95% of enterprise GenAI implementations fail to meet production expectations. That failure rate is not an implementation problem. It’s a mismatch between what the technology is and what organizations expect it to be.

Manage it like professional judgment, not like software

The right response to this is not to lower expectations. That’s too passive and too vague to be useful.

The right move is to adopt a different governance model entirely. One closer to how organizations manage professional judgment than how they manage software.

When you hire a consultant or engage outside legal counsel, you don’t write an SLA that promises identical advice on every engagement. You build in review gates, approval layers, and accountability structures. You treat the output as a recommendation, not a transaction. The answer might differ based on context and so you have a human in the loop before that recommendation becomes an action.

That is the correct mental model for an LLM. The Stochastic Gap paper from researchers at UMBC and MIT formalizes this. Using a Markov framework applied to a real procurement workflow across 251,000 cases, they show that the mismatch between deterministic enterprise workflow assumptions and probabilistic AI behavior is structural, not incidental.

The consultant analogy is useful but it only gets you so far. The more precise frame is professional judgment under institutional accountability. Law firms, audit practices, and medical institutions have spent decades building governance structures for exactly this problem. Expert output varies by practitioner, by day, and by context. It still needs to be defensible, traceable, and bounded by institutional policy.

Those structures share three characteristics that software governance does not.

Output is reviewed before it becomes a commitment. A legal opinion goes through a partner review before it’s sent. A diagnostic recommendation goes to an attending doctor before it reaches a patient. The expert produces the output; the institution validates it before it has consequences. Most LLM deployments skip this entirely. The model generates; the system acts. There is no partner review layer.

Variance is bounded by policy, not eliminated by process. You don’t solve variance in professional judgment by making every practitioner identical. You solve it by defining the boundaries within which variation is acceptable. Then you build escalation paths for cases that fall outside them. A tax advisor can give different guidance to different clients with different circumstances. That’s appropriate variance. An LLM giving different legal interpretations to identical queries in the same product is not. The governance question is whether you’ve defined the difference.

Accountability is personal and institutional simultaneously. When a lawyer gives bad advice, liability flows in two directions: to the individual practitioner and to the firm. Enterprise AI deployments are constructing a version of this whether they intend to or not. In 2024, a Canadian tribunal held Air Canada liable for incorrect information its chatbot gave a customer. The airline argued the bot was a separate legal entity outside their responsibility. The tribunal rejected that defense. Courts are already resolving institutional liability for AI output, and they are not resolving it in the operator’s favor. The question is whether your governance structure reflects that reality.

This Agent Drift paper introduces a useful concept here: the Agent Stability Index. It attempts to quantify behavioral consistency across model invocations. Not just whether the model gets the right answer, but whether it behaves in a predictable and bounded way over time. That’s the right unit of measurement for governance. Not accuracy but stability.

This reframes the deployment decision. The question is not “is this model accurate enough?” Accuracy is necessary but insufficient. The real question is: is this model stable enough, in a defined scope, with defined escalation paths, for this specific process? That’s how you evaluate a new hire in a professional services firm. It should be how you evaluate an LLM deployment.

Before deploying an LLM into any business process, three questions need answers:

Does this process require the same answer every time for legal, compliance, or contractual reasons?
Is there a human review layer between the LLM output and the consequential action?
If the LLM gives two different answers to the same question, which one are you liable for?

If you can’t answer those, you’re not ready to deploy. But there’s also a fourth question: who inside your organization owns the answer to all three? Software procurement has a buyer, a vendor, and a contract. Professional judgment has a supervising principal who is accountable for the output. LLM deployments currently have neither. They have a model, a prompt, and an assumption that someone else is responsible.

That assumption is what the courts are now testing.

The expectation mismatch has standing in court

This Air Canada case makes this concrete in a way that no whitepaper can.

In Moffatt v. Air Canada (BC Civil Resolution Tribunal, 2024), Air Canada’s chatbot gave a customer incorrect information about bereavement fares. The customer relied on it and tried to claim the discount. Air Canada’s defense was that the chatbot was a “separate legal entity” and not their responsibility.

The tribunal rejected this. Air Canada was held liable for the advice its AI gave.

The company deployed a probabilistic system into a customer-facing context where both the law and the customer assumed determinism. That expectation mismatch wasn’t just an internal governance failure. It had standing in court.

This is the trajectory for any business deploying LLMs into consequential interactions without understanding what they’ve actually built.

We are now in an era where companies are building legal and contractual liability on systems that cannot guarantee the same answer twice. The technology isn’t the problem. The category error is.

Steve Whittle

Discussion about this post

Ready for more?