Agentic AI · the missing meter · 30 Jun 2026

AI agents don't buy tokens. They buy outcomes.

A router reads a price tag and picks the cheapest model. Then the task runs — and the receipt keeps printing long after the price stops.

Price per million tokensrouter's view
Small model
$0.75
input · vs $5+ on the frontier tier
Router selects: cheapest
This number never changes during the task.
Receipt · one finished taskrunning…
First attemptplan, tool call, observationbase cost
Re-read contextprompt + prior output, again+ input tokens
Schema erroranswer came back in the wrong format+ retry
Tool call x3read files, run, read again+ input tokens
Fallback to a stronger modelcheap one stalled+ frontier price
Human repairsomeone fixes the patch by hand+ minutes
Cost per successful task the real billprice tag, unchanged: $0.75
Read the receipt

The cheapest model is not always the cheapest worker.

In a chat box, price per token feels like a useful number. You ask, it answers, you count the tokens. But agents do not behave like chat. They plan, call tools, read files, retry, fail formats, recover, and sometimes hand the job to a stronger model anyway. Every loop adds context. Every context gets re-read. The price tag stays small while the receipt grows.

The mistake is not "cheap models are bad." That is lazy and false. The mistake is subtler: we optimise the cost of a step instead of the cost of finishing the job. Cheapness only becomes expensive when the system never measures completion.

Cheap tokens are not cheap if they do not finish.

Scope The measured evidence here comes mainly from coding agents, where trajectories, tools, tests and success conditions are observable. The discipline very likely applies more broadly; the exact magnitudes do not.

01 · the snowball · why the receipt grows

In chat you pay for the answer. In agents you keep paying for memory.

The question: where does an agent's cost actually go?

A chat is one short line: prompt in, answer out. An agent is a loop — plan, call a tool, read what it returns, then fold the whole growing context back into the next request and do it again. The output stays a small visible sliver; the input side snowballs.

Chat is a line. An agent is a loop. input dominates · output is the sliver
A chat turn
prompt answer
One request. You mostly pay for the answer.
An agent task
prompt final
Many requests. Each re-reads the last. You pay, repeatedly, for the memory.
chat
agent
input tokens (the memory re-read each loop)output tokens (the answer)

Cited Bai et al. (2026), eight frontier models on SWE-bench Verified: agentic coding consumed about 1000x more tokens than code reasoning or chat, and input tokens drove overall cost. Stanford's Digital Economy Lab calls it the context snowball. Cursor (2026) puts input above 90% of non-cache token volume. One honest nuance: input and saved-context (cache) tokens are cheaper per token than output — but the sheer volume can still dominate the receipt. The bar widths are schematic, not one measured run.

The cost is not in the answer. It is in re-reading the question, over and over.

02 · the surprise · the agent can't price its own journey

The bill is not just large. It is unpredictable, even to the agent.

The question: could you just estimate the cost up front?

If agents could forecast their own spend, you would not need a meter — you would trust the quote. They cannot. The same task can vary enormously run to run, and models are bad at predicting their own usage before they start.

Estimate before · receipt after same task · up to 30x apart
Before the task — the estimate
“This should be cheap.”
A confident quote. But a model's own cost prediction correlates with reality only up to 0.39 — barely better than a hunch.
After the task — the receipt
retryretrytool call re-readfallbackpatch testhuman repair

Cited Bai et al. (2026): runs on the same task varied by up to 30x in total tokens, higher token use did not reliably improve accuracy, and models systematically underestimated their own usage (self-prediction correlation up to 0.39).

The agent cannot price the journey before it walks it. So you need a meter after the run, not confidence before it.

03 · the route flip · watch cheap become expensive

The same cheap model, two different receipts.

The question: does the cheapest model finish cheapest?

Pick a task. Two routes run it — a cheap-first route (start on the small model) and a measured route (use a stronger model where the data says it pays). Both receipts print. The winner is whoever finishes for less, after the failures are counted.

Choose the task · both receipts print illustrative shape · not measured data
A small, well-specified task.
Cheap-firstsmall model
cost to finish
Measured routestronger where it pays
cost to finish
Choose a task above and watch both receipts print.
Show the calculator — put in your own numbers

tokens × price × attempts ÷ success

Cheap-firstsmallranked first
retries, format repairs, tool loops
price / attempt trajectory cost
cost per success
Measured routemid / frontierranked first
fewer retries, fewer fallbacks
price / attempt trajectory cost
cost per success

Drag any slider. The routes re-rank by whichever unit the toggle selects.

Illustrative The receipts show the shape of an honest comparison, not a verdict about any named model. The shape is plausible because token use varies up to 30x on the same task and does not reliably track accuracy (Bai et al. 2026). Your real receipts would be built from your own runs.

Same model, same price tag. A different bill, once completion is counted.

04 · the honest part · this is not premium-model propaganda

The cheap model still wins simple work. It just has to win after the retries.

The question: when does cheap-first actually pay off?

You saw it in the receipts: on a short, well-specified task the cheap model finishes on the first try and is genuinely cheaper. The point was never which model. The point is that the comparison must include the failures, not just the sticker. Cheap-first earns the simple work; it loses the hard work to its own retries.

The cheap model is allowed to win. It just is not allowed to skip the retries.

05 · the unique part · a ledger, not a leaderboard

What is missing is not a better model. It is an outcome ledger.

The question: what would change a router's mind?

Most writing on this becomes vendor comparison: this model versus that one. That lane is crowded and goes stale every release. The durable lane is measurement. A good router should not only know prices — it should keep a live ledger, by task type, and let the measured history decide.

The ledger fills as real tasks run decision metric · cost per success, by task type
task typebest routeretrysuccesspricewhy it wins
small editsmall model 1.1x96%low finishes first try; price wins
bug + patchmid model 1.4x89%medium fewer retries beat the low sticker
migrationfrontier 1.2x93%high completion is cheaper than looping

The routing rule stops being "which model is cheapest?" and becomes "which model has the lowest measured cost to finish this kind of task?" — a different answer for an edit, a bug fix, and a migration. The numbers are illustrative; the structure is the contribution.

Stop ranking models. Start logging outcomes.

Before you change a routing policy, can you answer all of these?

A ledger only works if the run is actually metered. Here is what the meter has to record — and the load-bearing honesty: a higher per-token price is justified only after measured tokens-to-completion prove it lowers cost per successful outcome. Tick what you can actually measure today.

records
Tokens to finish, including the re-reads
input + output + cache
records
Retries, format failures, tool calls
trajectory length
records
Falls back to a stronger model
fallback rate
records
Whether it actually succeeded
success flag
divides by
Cost to finish, per successful task
cost / success
You do not have a cost strategy
You have a price preference. Without these measurements, "route to the cheaper model" is a guess wearing the costume of a decision.

0 of 8 measured · tick what you can prove

The only honest AI cost metric

Do not route to the cheapest model. Route to the cheapest completed outcome.

A cheap route is not proven by a low price. It is proven by a completed task, counted honestly. A price tag is not a receipt.

Log the attempts.
Log the fallbacks.
Log the failures.
Log the human repairs.
Then route the next task from the ledger, not the leaderboard.

The future router is not cheapest-first. It is evidence-first.

The same instinct, in other places: measure the outcome, not the surface.

Each companion essay takes one published record or system and refuses to score it on what is merely visible. This page is the argument; those are it in the field.

Sources, vendor pricing, the honest limits, and how to read the interactives

The evidence (Cited). Bai et al., How Do AI Agents Spend Your Money? (2026, arXiv 2604.22750) analysed trajectories from eight frontier models on SWE-bench Verified: agentic coding tasks consumed roughly 1000x more tokens than code reasoning or chat; input tokens (not output) drove overall cost; runs on the same task varied by up to 30x in total tokens; higher token use did not reliably improve accuracy; and models underestimated their own usage, with self-prediction correlations only up to 0.39. Stanford's Digital Economy Lab summarised the same effect as a "context snowball" — agents re-read the prompt, prior responses and tool outputs each step. Cursor's 2026 developer-habits report shows the input/output token ratio rising from about 4.5x in January to roughly 11–13x by April–May, with input above 90% of non-cache token volume, and input context rising toward about 70% of price-equivalent cost — while noting input and cache-read tokens are cheaper per token than output.

Pricing (Cited, but volatile). Vendor pricing pages show large per-token gaps between frontier and smaller tiers, and note that tool-use pricing bills input tokens, output tokens, and possibly server-side tool charges. Exact list prices change often, so this essay leans on the gap between tiers and one indicative small-tier figure, not on any single price staying current.

Scope and confidence. The measured base is mainly coding agents, a strong domain because tasks have observable trajectories, tools, tests, retries and success conditions. High confidence that agentic workloads are token-heavy, input-heavy, variable and hard to predict. Medium confidence that outcome-based routing beats price-per-token routing in every production setting, because that depends on measured task data you would collect. The verdict is stated as a discipline ("measure, then route"), not a universal empirical result.

How to read the interactives. The hero receipt is a schematic of where agent cost accrues, not a real invoice. The Chapter 3 guided receipts show the shape of an honest comparison for three task types; the numbers are illustrative, not a verdict about any named model. The optional calculator inside it computes cost per success from numbers you set; its starting values are an illustrative example, not a cited run. The Chapter 5 ledger numbers are illustrative; the structure — cost per success, by task type — is the actual contribution. Nothing here invents a benchmark result, a dollar figure, or a named-model verdict as fact.

The lines to take away. Cheap tokens are not cheap if they do not finish. A price tag is not a receipt. Route to the cheapest completed outcome.

Evidence: Bai et al. 2026 · Stanford Digital Economy Lab 2026 · Cursor 2026 · vendor pricing pages
Framing, the meter / receipt / ledger system, the guided receipts, illustration and essay: Michael Cengkuru · Published 30 Jun 2026