Context Rot

Not all context is created equal

Jun 04, 2026

Morph LLM published a study in March 2026 with a claim that every frontier model they tested gets worse at reasoning as you give it more to read. The output quality drops in a way you can measure and reproduce, and it happens on the exact models you are paying for. The instinct is to file this under “models are unpredictable” and move on. That instinct is the mistake this post is about.

Let’s start with the symptom, because you have probably already seen it. A multi-stage agent runs cleanly through step three. It reasons correctly, calls the right tools, returns what you expected. By step twelve the same agent, on the same task type, with the same model and the same prompt pattern, is producing degraded output. Nothing about the task got harder. Something about the context got heavier. That gap, between a system that works early and the same system failing late, is what I want to talk about. It’s called context rot, and it’s architectural, which is the part that matters. You can’t upgrade your way out of an architectural property.

The spec sheet measures the wrong thing

When a vendor prints 128K on the box, that number is a ceiling on what the model will accept. It says nothing about what the model can reason over reliably. Those are different quantities, and the gap between them is enormous.

Paulsen and colleagues gave the distinction names in September 2025. The Maximum Context Window is the spec sheet figure. The Maximum Effective Context Window is the point where accuracy meaningfully degrades for a given task. Their findings show that some top models showed severe degradation by 1,000 tokens. A few broke down with as little as 100 tokens in context. Across everything they tested, the effective window fell short of the advertised one by as much as 99 percent.

That’s a big difference. A 128K window can deliver reliable reasoning over something closer to 1K on a hard task. The other 127K is tokens the model will dutifully process and cannot actually use well.

The effective limit also moves with the job. A flat retrieval task and a multi-hop reasoning task over the same corpus have different effective windows, because reasoning over scattered evidence stresses attention harder than fetching one fact. So there is no single number to memorize. You need to figure out for yourself how it will impact your workflow.

The RAM comparison is exact, not loose. A machine with 128GB of memory does not run every workload 128 times faster than a 1GB machine. The number is capacity, not throughput. Context windows are the same. We have been reading a capacity spec as a performance spec, and the vendors are happy to let us.

Why attention thins out

Three mechanisms compound as context grows. You want the mechanisms, not just the result, because the mechanisms tell you what to do.

Position is not neutral. Liu and colleagues at Stanford showed transformer attention follows a U-shape across token positions. The model attends strongly to the start and the end of its context and weakly to the middle. In a 20-document QA task, accuracy dropped more than 30 percent when the relevant document sat in positions 5 through 15 versus position 1 or 20. The same fact, the same model, moved a few thousand tokens inward, and the model half-forgets it is there. Models trained explicitly on long context still showed it. So the first thing to internalize: where a token sits changes whether the model can use it most effectively.

Attention is a fixed budget split more ways. The attention mechanism distributes a finite amount of probability mass across every token in context. Add tokens and each one gets a thinner slice. This is not a tuning artifact. It falls out of the softmax math. More context does not mean more attention to go around. It means the same attention spread thinner, and past some point the model can no longer cleanly separate the signal token from the noise around it.

The curve flattens early. A model reading 100K tokens is not getting ten times the benefit of one reading 10K. Positional undertraining and the constraints of rotary position embeddings mean effective utilization grows far slower than the token count. The useful curve bends down well before the nominal limit. You are paying linearly for tokens whose value is already sub-linear.

Put the three together and the picture is not “long context is a bit lossy.” It is that the middle of a large context is a place where information goes to be underweighted, and you are filling it on purpose.

Agent loops manufacture their own noise

Static document Q&A is forgiving. You load a corpus once and ask questions. Agent loops are much less forgiving because they generate the very noise that rots them.

Watch what accumulates in a long-running loop. Verbose tool outputs, most of which were only partially relevant. Reasoning paths the model explored and abandoned. Intermediate states that are no longer true. And the one almost everyone forgets: prior turns where the user chose between options.

That last one deserves a deeper look, because it is the clearest illustration of the whole problem. The model offers two options. You pick the first. The second option is not dropped. It sits in context carrying the same weight as the path you chose. Ten turns later the model is reasoning over a growing pile of decisions you already killed. The dead ends do not decay. They sit alongside the live thread, drawing from the same thinning attention budget, and the model has no way to know they are dead unless you tell it.

This is why Morph LLM’s work, which folded in Chroma’s research on coding agents, landed on context rot as the primary failure mode for agentic coding. Not model capability. Not reasoning quality. The models are capable enough. The context they are handed is too noisy, and most of the noise is self-inflicted.

The diagnostic is precise once you know where to look. Degradation that scales with session length rather than task complexity is context rot. If your agent clears step three and fails step twelve with no change in task type, stop tuning the prompt. You are not looking at a capability problem. You are looking at a hygiene problem.

The two curves cross

Everything so far is about quality. The cost side is where it stops being an abstraction and starts showing up on the bill.

The KV cache scales with context length, and at high token volumes that scaling is not gentle. Latency climbs. VRAM consumption grows. Throughput falls. None of these track token count proportionally. They accelerate.

Think about how that pairs with the quality finding. The cost per useful unit of work is rising at exactly the point where each token is delivering less. This is not ordinary diminishing returns, where you pay the same for a bit less. The cost curve bends up while the quality curve bends down. Those two curves cross. Past the crossing point you are paying more to get worse output, and nothing in the system will let you know you’ve crossed it.

Mixture-of-Experts architectures make the diagnosis harder, not easier. They can hide this dynamic underneath infrastructure bottlenecks, so the system looks healthy right up until it is obviously not.

Manage it like memory, because that is what it is

The bottom line is that context is a resource with a budget, and right now most systems don’t see it that way. Here are three things you can do:

Budget context against your effective window, not the spec sheet. Estimate the effective window for your task type and set a token budget per loop or session against that number, not the advertised one. When you hit 60 to 70 percent of the budget, trigger a checkpoint. Make the checkpoint a structured state object, not a prose summary: decisions made, options rejected, current state. Then continue from the checkpoint instead of the full transcript. You are choosing what survives into the next turn rather than letting accumulation choose for you.

Prune the dead weight. A tool call that returned nothing useful should be stripped or compressed before the next turn. A rejected option should collapse to a single line. The model does not need the full text of a path it is not taking, and every token you remove is a token that can no longer dilute attention on the tokens that matter. Do this as part of the loop.

Measure degradation as a function of length, not just at turn one. Run your evals at turns 5, 10, and 20, not only at the start. Watch accuracy as context grows on your specific task. If you cannot state your effective window as a number, you do not know where your system fails, which means you are flying without a parachute.

A bigger window is not the solution. The Morph LLM data spans GPT-4.1, Claude Opus 4, and Gemini 2.5 Pro. Every one degrades. The window sizes differ. The architectural property does not. A larger window buys you more room to accumulate noise, not more reasoning.

What the vendors are not selling you

Context window size has become the spec that vendors compete on, the way hardware vendors once competed on raw clock speed. The number is real. What it implies about useful work is a different claim, and it’s the claim you don’t tend to hear about.

The engineers who build reliable long-context systems will not be the ones with the largest windows. They will be the ones who treat context the way a systems engineer has always treated memory: as a scarce resource with budgets, checkpoints, and active reclamation. The model will not do this for you. It cannot garbage collect its own context. It has no concept of which tokens are dead. That judgment, what to keep and what to get rid of, is the part that does not come in the box, and it’s up to you to figure out.

Steve Whittle

Discussion about this post

Ready for more?