Agentic AI · the missing meter · 30 Jun 2026

AI agents don't buy tokens. They buy outcomes.

A router reads a price tag and picks the cheapest model. Then the task runs, and the receipt keeps printing long after the price stops.

Plain version (Markdown)

Price per million tokensrouter's view

Small model

$0.75

input · vs $5+ on the frontier tier

Router selects: cheapest

This number never changes during the task.

Receipt · one finished task6 line items

First attemptplan, tool call, observationbase cost

Re-read contextprompt + prior output, again+ input tokens

Schema erroranswer came back in the wrong format+ retry

Tool call x3read files, run, read again+ input tokens

Fallback to a stronger modelcheap one stalled+ frontier price

Human repairsomeone fixes the patch by hand+ minutes

Cost per successful task the real billprice tag, unchanged: $0.75

Read the receipt

The cheapest model is not always the cheapest worker.

In a chat box, price per token feels like a useful number. You ask, it answers, you count the tokens. But agents do not behave like chat. They plan, call tools, read files, retry, fail formats, recover, and sometimes hand the job to a stronger model anyway. Every loop adds context. Every context gets re-read. The price tag stays small while the receipt grows.

The mistake is not "cheap models are bad." That is lazy and false. The mistake is subtler: we optimise the cost of a step instead of the cost of finishing the job. Cheapness only becomes expensive when the system never measures completion.

Cheap tokens are not cheap if they do not finish.

Evidence used here · all linked

Bai et al. 2026 Stanford Digital Economy Lab 2026 Cursor Developer Habits 2026 Vendor pricing & full notes below

Scope The measured evidence here comes mainly from coding agents, where trajectories, tools, tests and success conditions are observable. The discipline very likely applies more broadly; the exact magnitudes do not.

01 · the snowball · why the receipt grows

In chat you pay for the answer. In agents you keep paying for memory.

The question: where does an agent's cost actually go?

A chat is one short line: prompt in, answer out. An agent is a loop: plan, call a tool, read what it returns, then fold the whole growing context back into the next request and do it again. The output stays a small visible sliver; the input side snowballs.

Chat is a line. An agent is a loop. input dominates · output is the sliver

A chat turn

One request. You mostly pay for the answer.

An agent task

Many requests. Each re-reads the last. You pay, repeatedly, for the memory.

chat

agent

input tokens (the memory re-read each loop)output tokens (the answer)

Cited Bai et al. (2026), eight frontier models on SWE-bench Verified, a standard benchmark of real software fixes with pass/fail tests: agentic coding consumed about 1000x more tokens than code reasoning or chat, and input tokens drove overall cost. Stanford's Digital Economy Lab calls it the context snowball. Agents also save and re-read context (the cache). Cursor (2026) puts fresh input above 90% of non-cached token volume, and its input/output ratio rose from about 4.5x in January to roughly 11x to 13x by April and May. One honest nuance: input and cached tokens are cheaper per token than output, but the sheer volume can still dominate the receipt. The bar widths are schematic, not one measured run.

The cost is not in the answer. It is in re-reading the question, over and over.

02 · the surprise · the agent can't price its own journey

The bill is unpredictable, even to the agent.

The question: could you just estimate the cost up front?

If agents could forecast their own spend, you would not need a meter. You would trust the quote. They cannot. The same task can vary enormously run to run, and models are bad at predicting their own usage before they start.

Estimate before · receipt after same task · up to 30x apart

Before the task: the estimate

“This should be cheap.”

A confident quote. But a model's own cost prediction correlates with reality only up to 0.39, barely better than a hunch.

After the task: the receipt

retryretrytool call re-readfallbackpatch testhuman repair

Cited Bai et al. (2026): runs on the same task varied by up to 30x in total tokens, higher token use did not reliably improve accuracy, and models systematically underestimated their own usage (self-prediction correlation up to 0.39).

The agent cannot price the journey before it walks it. So you need a meter after the run, not confidence before it.

03 · the route flip · watch cheap become expensive

The same cheap model, two different receipts.

The question: does the cheapest model finish cheapest?

Pick a task. Two routes run it: a cheap-first route (start on the small model) and a measured route (use a stronger model where the data says it pays). Both receipts print. The winner is whoever finishes for less, after the failures are counted.

Choose the task · both receipts print illustrative shape · not measured data

Fix a bug and return a patch that passes the tests.

Cheap-firstsmall model loses

first attempt1.0

format error+ retry

retry+1.0

fallback to stronger+3.0

human repair+ minutes

cost to finish8.2

Measured routestronger where it pays wins

first attempt3.0

one retry+1.0

tests passok

cost to finish4.4

Cheap loses after retries. The low sticker is real, but the small model loops, falls back, and needs a human anyway. The stronger route finishes for less once completion is counted.

Show the calculator and put in your own numbers

tokens × price × attempts ÷ success

Cheap-firstsmallranked first

Token price (relative) 1.0x

Avg attempts to finish 4.5

retries, format repairs, tool loops

Success rate 55%

price / attempt1.00 cost of all attempts4.50

cost per success8.18

Measured routemid / frontierranked first

Token price (relative) 3.0x

Avg attempts to finish 1.3

fewer retries, fewer fallbacks

Success rate 88%

price / attempt3.00 cost of all attempts3.90

cost per success4.43

Drag any slider. Ranked by price per token, cheap-first is the obvious pick. Switch to cost per success and watch the same numbers re-rank.

Illustrative The receipts show the shape of an honest comparison, not a verdict about any named model. The shape is plausible because token use varies up to 30x on the same task and does not reliably track accuracy (Bai et al. 2026). Your real receipts would be built from your own runs.

Same model, same price tag. A different bill, once completion is counted.

04 · the honest part · this is not premium-model propaganda

The cheap model still wins simple work. It just has to win after the retries.

The question: when does cheap-first actually pay off?

You saw it in the receipts: on a short, well-specified task the cheap model finishes on the first try and is genuinely cheaper. The point was never which model. The point is that the comparison must include the failures, not just the sticker. Cheap-first earns the simple work; it loses the hard work to its own retries.

The cheap model is allowed to win. It just is not allowed to skip the retries.

05 · the unique part · a ledger, not a leaderboard

What is missing is not a better model. It is an outcome ledger.

The question: what would change a router's mind?

Most writing on this becomes vendor comparison: this model versus that one. That lane is crowded and goes stale every release. The durable lane is measurement. A good router should not only know prices. It should keep a live ledger, by task type, and let the measured history decide.

The ledger fills as real tasks run decision metric · cost per success, by task type

task typebest routeretrysuccesspricewhy it wins

small editsmall model 1.1x96%low finishes first try; price wins

bug + patchmid model 1.4x89%medium fewer retries beat the low sticker

migrationfrontier 1.2x93%high completion is cheaper than looping

The routing rule stops being "which model is cheapest?" and becomes "which model has the lowest measured cost to finish this kind of task?", a different answer for an edit, a bug fix, and a migration. The numbers are illustrative; the structure is the contribution.

Stop ranking models. Start logging outcomes.

Before you change a routing policy, can you answer all of these?

A ledger only works if the run is actually metered. Here is what the meter has to record, and the load-bearing honesty: a higher per-token price is justified only after measured tokens-to-completion prove it lowers cost per successful outcome. Tick what you can actually measure today.

records

Tokens to finish, including the re-reads

input + output + cache

records

Retries, format failures, tool calls

trajectory length

records

Falls back to a stronger model

fallback rate

records

Whether it actually succeeded

success flag

divides by

Cost to finish, per successful task

cost / success

You do not have a cost strategy

You have a price preference. Without these measurements, "route to the cheaper model" is a guess wearing the costume of a decision.

0 of 8 measured · tick what you can prove

The minimum honest AI cost metric

Do not route to the cheapest model. Route to the cheapest accepted outcome.

A cheap route is not proven by a low price, nor by the model reporting done. It is proven by work that passes the task's acceptance test, counted honestly. A price tag is not a receipt. For routine testable work, cost per accepted outcome is enough; for legal, financial, security, or public-money decisions, the metric has to carry quality and risk too.

Log the attempts.
Log the fallbacks.
Log the failures.
Log the human repairs.
Then route the next task from the ledger, not the leaderboard.

The future router is not cheapest-first. It is evidence-first.

Build the meter with me

Opens an email to michael@cengkuru.com about instrumenting your router.

The same instinct, in other places: measure the outcome, not the surface.

Each companion essay takes one published record or system and refuses to score it on what is merely visible. This page is the argument; those are it in the field.

Visible vs verifiable

The Road May Be Complete. The Record Is Not.

Malawi marks a road complete, priced, finished. The proof you would need to verify it is missing across all 162 projects. Presence is not the same as completion.

Run the scanner →

Access vs proof

The Door Is Not the Verdict

Transparency does not make a project good. It lets you test whether it is. Six inspection tests on a real Zambian record, the same gap between what you can see and what you can trust.

Run the tests →

Visible vs followable

When a Project Becomes Public

A project can be visible in concrete and still hidden in truth. Five links turn a thing people merely see into one they can follow, read, test, measure, and act on.

Follow the chain →

Sources, vendor pricing, the honest limits, and how to read the interactives

The evidence (Cited). Bai et al., How Do AI Agents Spend Your Money? (2026, arXiv 2604.22750) analysed trajectories from eight frontier models on SWE-bench Verified: agentic coding tasks consumed roughly 1000x more tokens than code reasoning or chat; input tokens (not output) drove overall cost; runs on the same task varied by up to 30x in total tokens; higher token use did not reliably improve accuracy; and models underestimated their own usage, with self-prediction correlations only up to 0.39. Stanford's Digital Economy Lab summarised the same effect as a "context snowball": agents re-read the prompt, prior responses and tool outputs each step. Cursor's 2026 developer-habits report shows the input/output token ratio rising from about 4.5x in January to roughly 11x to 13x by April and May, with input above 90% of non-cache token volume, and input context rising toward about 70% of price-equivalent cost, while noting input and cache-read tokens are cheaper per token than output.

Pricing (Cited, but volatile). Vendor pricing pages show large per-token gaps between frontier and smaller tiers, and note that tool-use pricing bills input tokens, output tokens, and possibly server-side tool charges. Exact list prices change often, so this essay leans on the gap between tiers and one indicative small-tier figure, not on any single price staying current.

Scope and confidence. The measured base is mainly coding agents, a strong domain because tasks have observable trajectories, tools, tests, retries and success conditions. High confidence that agentic workloads are token-heavy, input-heavy, variable and hard to predict. Medium confidence that outcome-based routing beats price-per-token routing in every production setting, because that depends on measured task data you would collect. The verdict is stated as a discipline ("measure, then route"), not a universal empirical result.

How to read the interactives. The hero receipt is a schematic of where agent cost accrues, not a real invoice. The Chapter 3 guided receipts show the shape of an honest comparison for three task types; the numbers are illustrative, not a verdict about any named model. The optional calculator inside it computes cost per success from numbers you set; its starting values are an illustrative example, not a cited run. The Chapter 5 ledger numbers are illustrative; the structure, cost per success by task type, is the actual contribution. Nothing here invents a benchmark result, a dollar figure, or a named-model verdict as fact.

The lines to take away. Cheap tokens are not cheap if they do not finish. A price tag is not a receipt. Route to the cheapest completed outcome.

Prefer it in plain text? Download the plain version (Markdown): the whole argument, the data table with evidence grades, and the sources, no interactives required.

Bring me the system

If this essay names a decision you own, send a qualified brief: what is blocked, what outcome you need, and when.

Open a qualified brief →

Evidence: Bai et al. 2026 · Stanford Digital Economy Lab 2026 · Cursor 2026 · vendor pricing pages
Framing, the meter / receipt / ledger system, the guided receipts, illustration and essay: Michael Cengkuru · Published 30 Jun 2026