glossary · ai
Token (LLM definition)
The atomic unit an LLM reads, writes, and bills by; roughly 4 characters or 0.75 words in English.
A token is the atomic unit of LLM input and output: the smallest chunk of text a model processes, bills for, and attends to. In English, one token is roughly 4 characters or 0.75 words. One thousand words runs about 1,300 tokens. The model never sees raw characters or whole words; it sees token IDs, and every forward pass operates on that sequence. Understanding tokens is not optional for AI PMs because the token is simultaneously the unit of computation, the unit of context, and the unit of billing. Conflating those three roles is where most cost surprises originate.
Why input and output tokens have different prices
This is the question interviewers ask next, and the answer matters. Input tokens are processed in a single parallel pass: the model reads your entire prompt at once, attending across all positions simultaneously. Output tokens are generated autoregressively: for each token the model produces, it runs a full forward pass, attends to everything in context, and picks the next token. That is why output is slower and costs more. The price gap is structural, not arbitrary.
Mid-2026 representative pricing, per 1M tokens:
| Model | Input | Output | Ratio |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 4x |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 5x |
| Gemini Flash | $0.30 | $2.50 | 8.3x |
The ratio matters as much as the absolute number. A feature that generates long outputs (summaries, drafts, reports) pays the output premium on every call. A feature that produces short structured outputs (classifications, yes/no verdicts, JSON fields) is much cheaper to run at scale. That distinction belongs in your product spec before you write the first line of code.
Reasoning tokens: the cost most teams discover too late
Models with extended thinking or chain-of-thought reasoning (OpenAI o3, Claude with extended thinking enabled) produce reasoning tokens as an internal scratchpad before generating a visible reply. These tokens are billed at output rates. They do not appear in the response the user sees.
A visible 500-token answer can consume 3,000 or more total tokens when reasoning is active. That is a 6x cost multiplier that does not show up in any response-length estimate. Teams that enable reasoning without modeling token cost per request routinely exceed their initial budget projection by 4 to 8x. The fix is simple: disable extended thinking for tasks that do not require it (classification, extraction, lookup), and model the reasoning overhead explicitly for tasks that do.
The hidden cost traps in 2026
Two traps that did not exist at scale two years ago:
Tokenizer version drift. Anthropic documented that Claude Opus 4.7 may consume up to 35% more tokens for identical text compared to prior Claude versions. If you have a cost model built on Claude 3.x benchmarks and you upgrade to Opus 4.7, your per-request cost does not stay flat. Build a re-benchmarking step into any model upgrade.
Multimodal token overhead. Image token costs vary wildly by provider and resolution. Gemini charges approximately 258 tokens per standard image. Claude Opus 4.7 charges up to roughly 4,784 tokens for a high-resolution image, about 3x what prior Claude versions charged for the same image. A product where users upload photos (receipts, screenshots, product images) needs to price multimodal tokens separately from text tokens, and must decide whether to serve high-res or downsample by default.
The cost formula every PM should know
Cost per request = (avg input tokens x input price per 1M) + (avg output tokens x output price per 1M)
Multiply by monthly request volume for total monthly spend. Run this at P50 and P90 usage: P50 is your expected cost, P90 is your cost at the tail where your heaviest users live.
At Claude Sonnet pricing, a request with 2,000 input tokens and 500 output tokens costs about $0.0135. At 1 million monthly calls, that is $13,500 per month before caching or routing. Whether that supports your price point is a PM question, not an engineering question.
Three levers a PM can propose without requiring architecture changes:
- Prompt caching. Anthropic charges 0.1x for cached-read tokens (90% discount); OpenAI offers similar rates. Any feature with a stable system prompt or shared document should have caching enabled by default.
- Batch API. OpenAI and Anthropic both offer roughly 50% discounts for non-real-time processing. Async use cases (report generation, bulk analysis, nightly summaries) should almost always use batch endpoints.
- Model routing. Send simple requests to cheaper models (Gemini Flash at $0.30 input) and complex requests to frontier models. A two-tier routing strategy commonly cuts total spend by 40 to 60% without degrading quality on the tasks that matter.
One more factor for global products: Chinese text tokenizes at roughly 0.6 tokens per character versus 0.3 for English. A global product serving non-English users will see meaningfully higher token counts for the same information density. Model this before pricing a global tier.
Tokens and the viable/lovable test
In 2026, feasibility is not the constraint. The token is the place where viability becomes concrete and measurable. The question is no longer “can we build this AI feature?” but “does the token math support a unit-economic model that works?”
A feature requiring long context and extended reasoning can deliver excellent results and still fail the viability test if the cost per user action makes it impossible to price profitably or competitively. Tokens also interact with lovability: features that require many back-and-forth turns (multi-step agents, long document processing) burn tokens at each step, and if that cost compresses margins to zero, the feature cannot be maintained at a quality level users will tolerate. For the framework that connects token cost to build/skip decisions, see LLM unit economics and proving viability.
Interview answer
strong
"A token is the unit an LLM processes and bills by: roughly 4 characters or 0.75 words in English, so 1,000 words is about 1,300 tokens. The key things I keep in mind as a PM: first, input and output tokens are priced differently because of how generation works. Input is a single parallel pass; output is autoregressive, one token at a time, which is why output costs 3 to 8x more depending on the provider. For Claude Sonnet that's $3 input versus $15 output per 1M tokens. Second, reasoning tokens are billed at output rates and don't appear in the visible response: a 500-token reply with extended thinking enabled can burn 3,000+ tokens total. Third, there are production cost traps: tokenizer version upgrades can silently inflate costs (Opus 4.7 consumes up to 35% more tokens than prior versions for the same text), and multimodal overhead is enormous and variable (Claude Opus 4.7 charges up to 4,784 tokens for a high-res image versus 258 for Gemini). I use a simple formula: avg input tokens times input price plus avg output tokens times output price equals cost per call, then multiply by monthly volume at P90 usage. For any feature I'm evaluating, I benchmark token counts on real production-representative traffic before writing the spec, then propose prompt caching, batch API, or model routing as first-pass optimizations. A PM who can't estimate token cost per user action can't honestly assess whether a feature is viable."
weak
"A token is basically a word, and you pay per token." This fails immediately: it skips why output costs more than input, which is the first follow-up. It conflates the context window with billing. It ignores reasoning tokens entirely. And it gives the interviewer nothing concrete to evaluate: no numbers, no formula, no product decision it informs. It is the answer of someone who read a definition stub but has never estimated a feature's unit economics before pitching it.
What the follow-up tests
At AI-native companies (Anthropic, OpenAI, Google DeepMind), the token question is a gateway to a harder one: “Walk me through the token unit economics of an AI feature you’ve shipped or designed.” The answer they want covers four things: what a token is mechanically; why input and output are priced differently; the hidden costs (reasoning, multimodal, tokenizer drift); and the product decision it drives, with a number, a model choice, and a conclusion about whether the feature is viable. If you can name a real cost per request and show how that affects your pricing or margin, you have cleared the bar.
For context on how token budgets interact with what the model can see in one call, see context window.