Counting Tokens

Optimizing token usage in applications

May 07, 2026

I’ve been building AI-powered features for the past several months, and one thing that became very clear very quickly is that token costs can’t be an afterthought. They’re a design constraint. When you’re running a healthcare platform with RAG pipelines, context-heavy prompts, and multiple LLM calls per user interaction, the bill is a direct reflection of how carefully you thought about what you’re actually sending to the model.

Most of the conversations I’ve seen about LLM cost optimization focus on the easy answer: pick a cheaper model. That’s the wrong starting point. Before you reach for a smaller model, you need to understand where your tokens are actually going. Otherwise, you’re optimizing blind.

First, understand your token budget

The single most useful shift in mindset is moving from “which model is cheaper?” to “how many tokens does this job actually need?”

Every LLM call has a token budget made up of roughly four buckets: system prompt, user/conversation input, retrieved context (if you’re doing RAG), and model output. Most engineering teams can tell you which model they’re on. Not many of them can tell you what percentage of their token spend is coming from each bucket, per endpoint, per feature, per user segment.

That’s the first thing to fix. Instrument your app to log tokens in and out at the endpoint level. You’ll almost always find that 20% of your calls are responsible for 80% of your token spend, and those hotspots are where optimization actually pays off. Spending a week shaving 10% off a low-volume admin workflow is a distraction. Shaving 30% off your highest-volume user-facing is much more meaningful.

The input side

Once you know where the tokens are going, the levers on the input side are fairly well-understood. The question is which ones are worth pulling.

Tighten your prompts. This sounds obvious but most system prompts accumulate cruft over time. Redundant framing, hedging language, politeness conventions the model doesn’t need. Audit yours with fresh eyes. “Summarize in 3 bullets” costs fewer tokens than a paragraph explaining what a good summary looks like. Directive beats descriptive.

Split your system prompt. If you have a large system prompt covering multiple capabilities, stop sending all of it on every call. Break it into a core section and optional sections, and load only what’s relevant for the task at hand. A prompt that’s always 3,000 tokens because it contains instructions for five features, when any given call only needs one of them, is something you can fix.

Tune your RAG retrieval. RAG is great, but undisciplined RAG drags enormous amounts of context into every call. Smaller chunk sizes, fewer retrieved chunks, a reranker to prioritize the most relevant passages, and deduplication to eliminate near-identical chunks all compound. You often don’t need half the document. You need the three paragraphs that actually answer the question.

Prune conversation history. Long chat threads accumulate fast. Rather than passing the full conversation on every turn, summarize earlier turns and keep a running state-of-the-world summary plus the last N interactions. The model doesn’t need a verbatim transcript of everything that happened three turns ago.

Preprocess before tokenizing. Strip content before it hits the model. Repeated disclaimers, HTML chrome, email signatures, standard headers. Anything that’s structurally consistent and informationally irrelevant to the task should be removed in code, not sent to the model and silently ignored.

The output side

Output tokens get less attention but they’re directly controllable.

Explicitly bound your responses. “At most 5 bullet points” or “under 150 words” is more effective than “be concise.” The latter is a suggestion; the former is a constraint. Use structured outputs (JSON, fixed schemas) wherever the output is machine-consumed. If you only need a label, a boolean, or a category, don’t ask for prose.

For user-facing answers, consider a layered approach: get a compact answer first, and only fetch a longer explanation on demand. Most users don’t need the full response most of the time.

The infrastructure layer

Beyond prompt engineering, there are a few architectural levers that can meaningfully change the economics.

Prompt caching. If your provider supports it, turn it on. Repeated system prompts and long shared documents shouldn’t be re-billed on every call. This is one of the highest-leverage, lowest-effort optimizations available. The work is mostly configuration, not engineering.

Semantic caching. For support questions, FAQs, and any domain with high query repetition, a semantic cache at the application layer can eliminate a large percentage of redundant model calls entirely. Vector similarity against past queries, reuse or lightly edit the cached response.

Model routing. Not all calls are equal. Route simple, well-defined tasks to smaller, cheaper models by default. Reserve the premium models for complex reasoning, high-stakes content, or tasks where quality differences are genuinely user-visible. The key is measuring this. A cheaper model that needs retries or follow-up calls can easily cost more than the premium model would have.

What doesn’t work

A few things that seem like good ideas and aren’t:

Adding “be concise” to every prompt and expecting large savings. It helps a little. The variance is high and the model will still expand if the rest of your prompt gives it room to.

Blindly truncating inputs. “Just take the first 4K tokens” sounds pragmatic but often drops critical context and degrades quality in ways that are hard to notice until something goes wrong.

Naive string compression: stripping punctuation, collapsing whitespace. This makes text less legible to the model and can actually increase token count due to how subword tokenization works. The model’s tokenizer and your intuitions about “shorter text” don’t always agree.

Swapping to a cheaper model without measuring tokens-per-task and quality first. A weaker model that requires multiple retries or produces output that needs downstream correction can easily end up costing more than staying on the better one.

A note on TOON and structured formats

There’s a newer format worth knowing about called TOON (Token-Oriented Object Notation). The idea is to strip JSON’s syntactic overhead, quotes, braces, repeated keys, by declaring keys once, like a header row. For flat, table-like data (user lists, product catalogs, RAG reference chunks, uniform agent outputs), benchmarks report roughly 30-60% prompt-token reduction with equal or slightly better accuracy on structured retrieval tasks.

It’s a real technique, but it comes with caveats. For deeply nested or irregular objects, the savings shrink or reverse. JSON is deeply embedded in LLM training data; TOON is a format the model hasn’t seen nearly as much, which introduces some fragility without fine-tuning. You’re also adding a serialization layer with converters, validation, and debugging in a nonstandard format that has real engineering cost.

The right framing for TOON: keep JSON in your APIs and storage. Use TOON only at the LLM boundary, and only when your prompts are heavy on structured data and the token savings are large enough to justify the complexity. It’s a specialized tool, not a universal one.

When not to optimize

For small documents, low-volume internal tools, or safety-critical tasks, it’s often better to overspend on tokens than to risk degraded quality or missed edge cases. Token optimization has a cost in engineering time, added complexity, and potential quality tradeoffs. The ROI only makes sense at the hotspots.

The teams that get this right aren’t the ones that optimize everything. They’re the ones that instrument first, identify the real cost drivers, and apply targeted effort where it makes the biggest difference.

Token spend is a design output. If you’re surprised by your LLM bill, the answer isn’t a cheaper model. It’s a closer look at what you’re asking the model to do, and why.

Steve Whittle

Discussion about this post

Ready for more?