AI agents don't buy tokens. They buy outcomes.
A router reads a price tag and picks the cheapest model. Then the task runs — and the receipt keeps printing long after the price stops.
The cheapest model is not always the cheapest worker.
In a chat box, price per token feels like a useful number. You ask, it answers, you count the tokens. But agents do not behave like chat. They plan, call tools, read files, retry, fail formats, recover, and sometimes hand the job to a stronger model anyway. Every loop adds context. Every context gets re-read. The price tag stays small while the receipt grows.
The mistake is not "cheap models are bad." That is lazy and false. The mistake is subtler: we optimise the cost of a step instead of the cost of finishing the job. Cheapness only becomes expensive when the system never measures completion.
Cheap tokens are not cheap if they do not finish.
Scope The measured evidence here comes mainly from coding agents, where trajectories, tools, tests and success conditions are observable. The discipline very likely applies more broadly; the exact magnitudes do not.
In chat you pay for the answer. In agents you keep paying for memory.
A chat is one short line: prompt in, answer out. An agent is a loop — plan, call a tool, read what it returns, then fold the whole growing context back into the next request and do it again. The output stays a small visible sliver; the input side snowballs.
Cited Bai et al. (2026), eight frontier models on SWE-bench Verified: agentic coding consumed about 1000x more tokens than code reasoning or chat, and input tokens drove overall cost. Stanford's Digital Economy Lab calls it the context snowball. Cursor (2026) puts input above 90% of non-cache token volume. One honest nuance: input and saved-context (cache) tokens are cheaper per token than output — but the sheer volume can still dominate the receipt. The bar widths are schematic, not one measured run.
The cost is not in the answer. It is in re-reading the question, over and over.
The bill is not just large. It is unpredictable, even to the agent.
If agents could forecast their own spend, you would not need a meter — you would trust the quote. They cannot. The same task can vary enormously run to run, and models are bad at predicting their own usage before they start.
Cited Bai et al. (2026): runs on the same task varied by up to 30x in total tokens, higher token use did not reliably improve accuracy, and models systematically underestimated their own usage (self-prediction correlation up to 0.39).
The agent cannot price the journey before it walks it. So you need a meter after the run, not confidence before it.
The same cheap model, two different receipts.
Pick a task. Two routes run it — a cheap-first route (start on the small model) and a measured route (use a stronger model where the data says it pays). Both receipts print. The winner is whoever finishes for less, after the failures are counted.
Show the calculator — put in your own numbers
tokens × price × attempts ÷ success
Drag any slider. The routes re-rank by whichever unit the toggle selects.
Illustrative The receipts show the shape of an honest comparison, not a verdict about any named model. The shape is plausible because token use varies up to 30x on the same task and does not reliably track accuracy (Bai et al. 2026). Your real receipts would be built from your own runs.
Same model, same price tag. A different bill, once completion is counted.
The cheap model still wins simple work. It just has to win after the retries.
You saw it in the receipts: on a short, well-specified task the cheap model finishes on the first try and is genuinely cheaper. The point was never which model. The point is that the comparison must include the failures, not just the sticker. Cheap-first earns the simple work; it loses the hard work to its own retries.
The cheap model is allowed to win. It just is not allowed to skip the retries.
What is missing is not a better model. It is an outcome ledger.
Most writing on this becomes vendor comparison: this model versus that one. That lane is crowded and goes stale every release. The durable lane is measurement. A good router should not only know prices — it should keep a live ledger, by task type, and let the measured history decide.
The routing rule stops being "which model is cheapest?" and becomes "which model has the lowest measured cost to finish this kind of task?" — a different answer for an edit, a bug fix, and a migration. The numbers are illustrative; the structure is the contribution.
Stop ranking models. Start logging outcomes.
Before you change a routing policy, can you answer all of these?
A ledger only works if the run is actually metered. Here is what the meter has to record — and the load-bearing honesty: a higher per-token price is justified only after measured tokens-to-completion prove it lowers cost per successful outcome. Tick what you can actually measure today.
0 of 8 measured · tick what you can prove
Do not route to the cheapest model. Route to the cheapest completed outcome.
A cheap route is not proven by a low price. It is proven by a completed task, counted honestly. A price tag is not a receipt.
Log the fallbacks.
Log the failures.
Log the human repairs.
Then route the next task from the ledger, not the leaderboard.
The future router is not cheapest-first. It is evidence-first.
The same instinct, in other places: measure the outcome, not the surface.
Each companion essay takes one published record or system and refuses to score it on what is merely visible. This page is the argument; those are it in the field.
The Road May Be Complete. The Record Is Not.
Malawi marks a road complete, priced, finished. The proof you would need to verify it is missing — across all 162 projects. Presence is not the same as completion.
Run the scanner →The Door Is Not the Verdict
Transparency does not make a project good. It lets you test whether it is. Six inspection tests on a real Zambian record — the same gap between what you can see and what you can trust.
Run the tests →When a Project Becomes Public
A project can be visible in concrete and still hidden in truth. Five links turn a thing people merely see into one they can follow, read, test, measure, and act on.
Follow the chain →Sources, vendor pricing, the honest limits, and how to read the interactives
The evidence (Cited). Bai et al., How Do AI Agents Spend Your Money? (2026, arXiv 2604.22750) analysed trajectories from eight frontier models on SWE-bench Verified: agentic coding tasks consumed roughly 1000x more tokens than code reasoning or chat; input tokens (not output) drove overall cost; runs on the same task varied by up to 30x in total tokens; higher token use did not reliably improve accuracy; and models underestimated their own usage, with self-prediction correlations only up to 0.39. Stanford's Digital Economy Lab summarised the same effect as a "context snowball" — agents re-read the prompt, prior responses and tool outputs each step. Cursor's 2026 developer-habits report shows the input/output token ratio rising from about 4.5x in January to roughly 11–13x by April–May, with input above 90% of non-cache token volume, and input context rising toward about 70% of price-equivalent cost — while noting input and cache-read tokens are cheaper per token than output.
Pricing (Cited, but volatile). Vendor pricing pages show large per-token gaps between frontier and smaller tiers, and note that tool-use pricing bills input tokens, output tokens, and possibly server-side tool charges. Exact list prices change often, so this essay leans on the gap between tiers and one indicative small-tier figure, not on any single price staying current.
Scope and confidence. The measured base is mainly coding agents, a strong domain because tasks have observable trajectories, tools, tests, retries and success conditions. High confidence that agentic workloads are token-heavy, input-heavy, variable and hard to predict. Medium confidence that outcome-based routing beats price-per-token routing in every production setting, because that depends on measured task data you would collect. The verdict is stated as a discipline ("measure, then route"), not a universal empirical result.
How to read the interactives. The hero receipt is a schematic of where agent cost accrues, not a real invoice. The Chapter 3 guided receipts show the shape of an honest comparison for three task types; the numbers are illustrative, not a verdict about any named model. The optional calculator inside it computes cost per success from numbers you set; its starting values are an illustrative example, not a cited run. The Chapter 5 ledger numbers are illustrative; the structure — cost per success, by task type — is the actual contribution. Nothing here invents a benchmark result, a dollar figure, or a named-model verdict as fact.
The lines to take away. Cheap tokens are not cheap if they do not finish. A price tag is not a receipt. Route to the cheapest completed outcome.
Evidence: Bai et al. 2026 · Stanford Digital Economy Lab 2026 · Cursor 2026 · vendor pricing pages
Framing, the meter / receipt / ledger system, the guided receipts, illustration and essay: Michael Cengkuru · Published 30 Jun 2026