Does AI Fit in the workflow?

Not every workflow needs it

May 26, 2026

In October 2023, New York City launched an AI chatbot to help small business owners navigate city regulations. The mayor’s office called it a frictionless doorway into City Hall. Five months later, investigative journalists at The Markup tested it and found it was confidently telling landlords they did not have to accept Section 8 housing vouchers. That is illegal in New York City. The same bot told employers they could take a cut of their workers’ tips. Also illegal. When ten journalists asked the same housing question independently, all ten got the same wrong answer.

The city added a disclaimer. The mayor defended it and kept it running. The next administration eventually called it “functionally unusable” and moved to shut it down. The bill for building it: roughly $600,000.

Nobody asked the right question before launch. Not “can we build this” but “does a probabilistic text system belong in the execution path of legal regulatory guidance, where there is exactly one correct answer per question and the person asking has no way to know when they’ve received the wrong one.”

That question, whether AI belongs in a given workflow at all, is the one most organizations skip. They go straight to “how do we make this work” and treat the premise as settled. It isn’t always.

The question teams are not asking

When an AI deployment struggles in production, the instinct is to reach for better models, more data, or tighter prompt engineering. The conversation becomes technical quickly. That is the wrong reflex.The right first question is simpler and more uncomfortable: should AI be in this workflow at all?Not “can it be done” but “should it be done here, in this position, with these consequences attached.”Many teams never ask this. Gartner projects that 40% of enterprise applications will feature task-specific AI agents by 2026. MindStudio found that only 23% of companies run fully autonomous agent systems despite 88% already applying AI somewhere. The gap between “we use AI” and “we have thought carefully about where AI runs” is huge. That gap is where the expensive mistakes live.

Blast radius is the first variable

The first question to ask about any workflow is: what happens when this output is wrong?

Not if. When. Every probabilistic system produces wrong answers. The question is what those wrong answers cost.

A wrong answer in marketing copy gets caught in review and fixed in ten minutes. A wrong answer in a financial reconciliation can trigger a downstream payment workflow before any human sees it. A wrong answer in a regulatory guidance tool may prompt a landlord to illegally reject a tenant, with no visibility until a fair housing complaint lands.

This is blast radius. It is not a measure of how often the system fails. It is a measure of how bad a single failure can get. High blast radius workflows require a different standard of scrutiny before AI gets anywhere near the execution path.

The NYC chatbot did not fail because the underlying model was bad. It failed because the blast radius of wrong legal guidance is real harm to real people. The city had no way to know how many business owners had already acted on the bad advice before The Markup published its findings.

Detectability is the second variable

The second question is: how would you know the output was wrong, and how quickly?

This one is underweighted in almost every AI deployment conversation I have seen. Teams spend enormous effort on accuracy and almost no effort on error visibility.

Some errors are self-evident. A generated email that addresses the recipient by the wrong name gets caught before it sends. Some errors are invisible. A landlord who received incorrect guidance from an official city tool has no reason to doubt it. The tool carried government branding. It answered confidently. The correct answer was on a different page of the same website, and nobody told them to go look there.

The combination that kills enterprises is low detectability paired with high blast radius. That is the dangerous quadrant. Errors that are hard to see and expensive when they land. That is where autonomous agents are being deployed right now at scale.

The EU AI Act, phased in between 2024 and 2027, is not primarily about model accuracy. It is about auditability. Regulators are not asking whether your AI is usually right. They are asking whether you can prove what it did and why. That is a detectability requirement disguised as a compliance requirement.

Variance is only a feature in the right context

The third question is: does the right answer change between runs, or is there one correct answer?

Large language models are non-deterministic by design. They produce different outputs given the same input. For some tasks that variance is the point. Generating five options for a campaign headline and picking the best one is a workflow where variance creates value.

For other tasks variance is a defect. A financial reconciliation has one correct answer. A patient’s medication dosage has one correct answer. A contract clause either complies with local law or it does not.

The NYC chatbot illustrated this with unusual clarity. Whether a landlord is legally required to accept a housing voucher is not a matter of interpretation. There is a correct answer. Yet the bot gave different answers at different times and gave the wrong answer to all ten journalists who tested it simultaneously. A system producing variance in a zero-variance workflow is not a system having a bad day. It is a system that was never fit for the task.

Variance only creates value where the space of acceptable outputs is large and the evaluator is a human who can use judgment to select. Anywhere the evaluator is a downstream system, a database constraint, or a person who treats official guidance as authoritative, variance is a bug.

Downstream systems change the risk profile entirely

The fourth question is: is there a system consuming this output directly, without human review in between?

This is the difference between a soft consumer and a hard consumer. A human reading an AI-generated summary is a soft consumer. They bring judgment. They catch obvious errors. They can push back.

An API endpoint that ingests an AI output and triggers an action is a hard consumer. It has no judgment. It cannot catch errors. It will execute faithfully on whatever it receives.

The NYC chatbot consumer was human but effectively hard. A small business owner consulting an official government tool on a compliance question is not going to fact-check the answer. The government branding collapsed the soft consumer into something that behaved like a hard one. The output fed directly into a decision with no meaningful review layer in between.

When AI feeds a hard consumer, the error characteristics of the system change completely. Errors no longer surface through human review. They surface through system failures, audit findings, or in this case, investigative journalism.

Compliance exposure is not a legal problem

The fifth question: is this workflow subject to audit or regulatory review?

This gets treated as a legal team concern. It is not. It is an engineering and product concern that shows up as a legal problem later.

Regulated workflows require explainability. They require audit trails. They require the ability to reconstruct exactly what the system did and on what basis. Current LLM architectures are not designed to produce that by default. You can build it in, but it requires deliberate design choices made before deployment, not retrofitted after a regulator asks a question.

The EU AI Act classifies certain AI uses as high-risk. Legal and regulatory guidance is squarely in that scope. High-risk classification does not mean prohibited. It means the compliance burden is significant enough that you need to answer all five questions before deployment, not after the headline.

The real question behind the questions

Run these five questions against any workflow you are evaluating for AI automation:

What happens if this output is wrong? (blast radius)
How would you know it was wrong, and how quickly? (detectability)
Does the right answer change between runs or is there one correct answer? (variance tolerance)
Is there a downstream system or a trusting human consuming this output directly? (soft vs. hard consumer)
Is this workflow subject to audit or regulatory review? (compliance exposure)

A workflow with high blast radius, low detectability, zero variance tolerance, a hard consumer, and regulatory exposure is not a workflow where you deploy an AI agent today.

Steve Whittle

Discussion about this post

Ready for more?