AI Agent Drift
Do you know when your Agent starts to go wrong?
I run a competitive research workflow on a regular cadence. I use the same prompt, same tools, and the same intent. I want to map the landscape, surface new entrants and track what the competition is up to. The first few runs looked good. Then something started shifting. Not the market, our competitors in this space don’t move that fast. The outputs were different. Different companies surfaced. Different framing of the same players. Different conclusions from largely the same inputs.
The agent hadn’t broken. It was still producing plausible, well-structured research. It just wasn’t producing consistentresearch.
The gap between an agent that performs well and an agent that performs reliably is not something I hear talked about very often.
Why the demo always works
LLMs don’t retrieve. They generate. Every output is a sample from a probability distribution shaped by the prompt, the context window, and the model version. In a single-turn interaction this rarely matters. The variation is small and the output is usually close enough.
Agentic systems change the equation. Outputs become inputs to the next step. Tools get called, state accumulates, and decisions made in step two shape what is possible in step six. Small variations compound. The agent that searched slightly differently in step two is now summarizing different sources in step four. It draws different conclusions in step six. This is not a bug in any conventional sense. It’s the nature of the beast.
A demo is a single run. It proves the agent can do the task. It says nothing about whether the agent will do the task the same way tomorrow, after a prompt change, or after the underlying model gets quietly updated by the provider.
Reliability is not following capability
Recent research makes this concrete. A January 2026 simulation study from arXiv (arxiv.org/abs/2601.04170) defines agent drift as the progressive degradation of behavior, decision quality, and coherence over extended interactions. It identifies three distinct forms: Semantic drift is where outputs deviate from original intent. Coordination drift is where multi-agent coherence breaks down. Behavioral drift is where the agent develops unintended strategies over time. Using theoretical modeling across simulated enterprise workflows, the study projects that unchecked drift could lead to task success rates dropping by over 40% and human intervention requirements tripling. These are projected figures from simulation, not measured production outcomes. But the underlying framework for why the failure mode compounds rather than plateaus is well-constructed.
A large-scale empirical survey from UC Berkeley, Stanford, UIUC, and IBM Research (arXiv:2512.04123) gives the clearest picture of how practitioners are responding. Of 306 practitioners surveyed, 68% keep their deployed agents to at most 10 steps before human intervention. 70% adjust the prompts rather than fine-tuning the model. 74% rely primarily on human evaluation. The researchers frame this as a deliberate paradox: reliability is the top development challenge, yet agents are reaching production. The resolution is that teams are not waiting for the reliability problem to be solved. They ship by limiting what agents can do. Constrained autonomy, sandboxed environments, internal deployment first. It works. But it’s a workaround, not a solution.
Not all agents carry the same risk
Agents can be broken down into two main categories. The risk profile for each is different.
Bounded agents are invoked for a discrete task, produce an output, and hand off to a human or downstream process. Example include: A Cursor session writing a function. A one-shot document summary. The scope is defined. The output is reviewable. Failures are localized. These carry constant risk over time that is largely tied to human review.
Ambient agents run continuously and make ongoing judgment calls without a hard stop. For example, inbox triage or continuous competitive monitoring. Basically any workflow where the agent decides what matters and acts on it repeatedly, without a human checkpoint between decisions.
My competitive research workflow sits between these two. It is repeatable rather than truly continuous, but the expectation of consistency is the same. When I run it on Monday and again in three weeks, I expect the differences in output to reflect differences in the market, not differences in the agent. That’s not what happened.
McKinsey’s 2025 global survey found that 62% of organizations are at least experimenting with AI agents. Only 23% are scaling one in production. The gap between experimentation and scale is not a capability gap. It is a trust, observability, and governance gap.
If you’re running long-horizon agents, here’s what helps
The research and the emerging vendor landscape have converged on a set of mitigation approaches. None of them eliminate the problem but they can reduce risk.
Context management. One of the least visible failure modes in long-running agents is context drift. As conversation history grows, reasoning quality degrades before you ever hit a context limit. The industry has settled on episodic consolidation: periodically compressing older context into structured summaries while preserving recent and relevant state. The Agent Drift paper identifies this as one of three mitigation strategies with the strongest theoretical grounding. Anthropic now ships a native compaction API that automates the loop.
Uncertainty-aware memory. A January 2026 paper from Salesforce AI Research calls the core failure mechanism in long-horizon agents the Spiral of Hallucination. A small grounding error in an early step gets committed to the agent’s context. It then becomes a false premise for every subsequent step. Standard self-reflection does not reliably catch this. The model has already accepted the error as ground truth. The proposed fix flags low-confidence steps before they propagate and triggers correction only when needed. Early results showed meaningful reliability improvements on multi-step benchmarks. This is early research. But it is getting at the cause rather than the symptom.
Checkpointing and interrupt design. Orchestration frameworks like LangGraph have built explicit checkpointing into their execution model. Agents are defined as directed graphs with typed state and hard interrupt points. A human can review, approve, or reset to a known-good checkpoint at any of those points. This converts a brittle autonomous system into a collaborative one. Carnegie Mellon benchmarks published in late 2025 found that leading agents complete only 30-35% of multi-step tasks successfully. This shows that uninterrupted autonomous execution is not the right default for complex workflows.
Golden dataset evaluation. This approach maps most directly to my competitive research problem and our product work. Create a set of representative inputs with human-verified expected outputs. Then run your agent against that dataset on a schedule or before any prompt change goes to production. AWS introduced this at re:Invent 2025 with the general availability of Bedrock AgentCore Evaluations: 13 built-in evaluators, CI/CD pipeline integration for pre-deployment gates, and continuous online evaluation against live production traffic. A demo showed the service detecting tool selection accuracy dropping from 0.91 to 0.3 in production. Without continuous measurement, that degradation is invisible.
Pushpay documented a real production implementation of this pattern. Their golden dataset covers over 300 representative queries with validated responses. It is continuously curated from actual user interactions and fed into an engineering dashboard. The key word is continuously. A golden dataset that does not evolve with your actual workload tests against past state not current state.
Beyond AWS, the commercial tooling has matured fast. Braintrust ties production traces and offline experiments to the same scorer library. A production regression automatically seeds the next test cycle. LangSmith integrates human annotation queues with trace replay, letting engineers convert production failures into evaluation cases. Arize offers always-on drift detection at the session and span level. For teams with HIPAA or data residency constraints, Langfuse is the strongest self-hosted open-source option. It was acquired by Clickhouse in January 2026, but the open-source codebase remains active.
None of this is free. Building and maintaining a golden dataset requires human judgment to define what “correct” looks like for open-ended tasks. That is genuinely hard when correctness is partly subjective. Dataset rot is a real risk. The infrastructure to run evaluations continuously has real cost. The tooling can solve the infrastructure problem but the curation problem is still yours.
For my competitive research workflow, the approach is well-suited. The expected output structure is defined even if the specific content varies. I know what a well-formed competitive analysis looks like. I can score for completeness, source coverage, and structural consistency without specifying exact content in advance. That is an easier evaluation target than most ambient agent tasks.
The durability problem
The industry has gotten very good at demonstrating what agents can do. It has not gotten as good, so far, at ensuring they keep doing it the same way.
Gartner projects that 40% of agentic AI projects will fail by 2027. Poor risk controls are cited as a primary cause. That figure will land as a surprise to anyone who has only ever evaluated their agents at a single point in time.
Narrow, monitored, bounded agents are viable today if you build them with that constraint in mind. Always-on autonomous agents are still waiting on better reliability science, better evaluation tooling, and more organizational honesty about the governance they require.
The question worth asking before deploying any agent is not “can it do the task.” It is whether you can tell when it starts doing the task differently than it did before. And whether you would know before your users do.

