ai pm · thesis

LLM unit economics: the one-pager every PM needs

Updated Jun 2026 Calibrated to the strong-hire bar

Most PM candidates can name “prompt caching” and “model routing” as cost levers. Few can walk through the dollar math live. That gap is what interviewers at AI-native companies are probing: not whether you know the vocabulary, but whether you can pick up a napkin and show whether the feature pencils out before anyone writes a line of code.

The formula you need to memorize

Cost per session = (input_tokens × $/M_input ÷ 1,000,000) + (output_tokens × $/M_output ÷ 1,000,000)

That is the whole thing. Everything else is a decision about which numbers go into that formula.

2026 model prices (the ones you will name in interviews)

ModelInput $/MOutput $/MPosition
Claude Sonnet 4.6$3.00$15.00Balanced mid-tier
Claude Opus 4.7$5.00$25.00Frontier reasoning
GPT-5.5$5.00$30.00OpenAI frontier
Gemini 3.5 Flash$1.50$9.00Budget mid-tier
DeepSeek V4 Pro$0.44$0.87Budget tier

The spread from cheapest usable ($0.44/M input) to most capable ($30/M output) is roughly 100x. That spread is the core tension you are managing as a PM. Output tokens cost 2 to 5x more per token than input tokens across every model. Controlling response length is a cost lever you can mandate directly in a PRD.

The two worked examples: $4.20 vs $0.04

The expensive case. A coding agent making 200 API calls per session on Claude Opus 4.7 ($5/M input, $25/M output), each call with 2,000 input tokens and 800 output tokens, costs: (200 × 2,000 × $5 ÷ 1,000,000) + (200 × 800 × $25 ÷ 1,000,000) = $2.00 + $4.00 = $6.00 per session. Add context window accumulation across a long session and you are at $7 or more before you have checked the bill. At 20 sessions per month, that is $140 per user per month in LLM costs alone. The product has to price like enterprise software or the margin math never works.

The cheap case. A support FAQ bot routing 80% of queries to DeepSeek V4 Pro, with a 2,000-token system prompt that gets cached on every call and a 300-token user question producing a 400-token answer: input = (2,000 cached tokens × $0.044/M) + (300 × $0.44/M) = $0.000088 + $0.000132 = $0.00022; output = 400 × $0.87/M = $0.000348. Total: roughly $0.0006 per query, or $0.03 to $0.05 per session at typical session lengths. The remaining 20% of queries routed to a capable model for hard cases costs a little more, but the blended rate stays well under $0.10.

That $4.20 vs $0.04 contrast (the rough practical range) is the number that lands in interviews. An interviewer asks “what does this cost to run?” and the strong candidate gives a number, not a shrug.

Prompt caching: a product architecture decision, not an engineering optimization

Prompt caching lets you reuse tokens that appear identically at the start of every API call (system prompts, few-shot examples, background documents). Anthropic offers up to 90% discount on cached input tokens. In practice, apps with repetitive system prompts see 45 to 80% reduction on the input bill.

The PM decision is: what is your expected cache hit rate, and does your architecture earn it?

  • Support and FAQ agents: 40 to 60% cache hit rate is realistic. The system prompt is large and stable, the user questions are short, and most queries follow recognizable patterns. Caching pays well here.
  • Open-ended conversational products: 10 to 30% cache hit rate. The system prompt may be stable, but the conversation history diverges constantly, which means the cacheable prefix is short and the dynamic suffix is long. Caching helps less.

When an interviewer asks you to explain caching to a non-technical exec, the PM version is: “We store the expensive part of every request so we only pay for it once. For a product where the setup is always the same and the questions vary, we cut the inference bill by 50% without changing what users see.”

When caching backfires: if your content is highly dynamic (personalized context, real-time data in the prompt), cache hit rates drop and you pay the cache write overhead with little return. Name this tradeoff. Candidates who only cite the upside fail the nuance probe.

Model routing: a quality-tier decision, not a DevOps rule

Model routing means deciding at runtime which model handles a given request based on the task’s required reasoning level. A cascade that routes 80% of requests to a cheap model and 20% to a frontier model cuts costs 60 to 70% with no visible quality degradation, because most requests do not need frontier reasoning.

The PM framing is: what is the minimum quality bar this task requires?

A support query asking for a return policy does not need Claude Opus. A legal document review that a lawyer will sign off on does. Routing accuracy needs to hit at least 90%: if 15% of hard tasks get sent to the cheap model, you are creating silent quality failures that are harder to catch than cost overruns.

The routing rule does not live in infrastructure. It lives in the product spec: what task types exist in this product, and which quality tier does each require? That is a PM question.

The viability test: when does this pencil out?

Cost per session must be recoverable from ARPU × session frequency.

At $4.20 per session and 20 sessions per month, LLM cost alone is $84 per user per month. That is only recoverable if you price as enterprise software. At $0.04 per session and 20 sessions per month, cost is $0.80 per user per month, which means a free tier is fundable and a $10 consumer subscription runs at a healthy gross margin.

The benchmark for a $10,000/month AI product: model inference is roughly 65% ($6,500) of total AI infrastructure cost, with embeddings at 15%, vector DB at 10%, and storage at 5%. These proportions shift as you scale, but the inference dominance holds. That is why the session cost calculation is the primary viability lever, not secondary.

Self-hosting changes the math at scale: GPU infrastructure becomes cheaper than per-token API costs at approximately 50 million tokens per month. Below that threshold, the API is cheaper. A PM who names this breakeven shows the interviewer they understand the build vs buy question inside the unit economics, not just as an abstract principle.

What separates strong from weak in the interview

strong

"Let me walk through the math. At our expected session volume, with [X] input tokens and [Y] output tokens per session on [model], session cost is approximately [Z]. At [ARPU] and [frequency] sessions per month, LLM cost is [N]% of what the user pays, which gives us [margin]% gross margin at scale. To get that down, I'd look at two levers first: routing the [X]% of queries that don't need frontier reasoning to DeepSeek or Gemini Flash, which cuts costs 60 to 70% on that slice, and caching the system prompt since it's [stable/large]. Cache hit rate for this use case should run 40 to 60%. Together those get us to a session cost where a free tier is fundable and a $20/month plan runs above 60% gross margin."

weak

"We'd use prompt caching and model routing to keep costs low. GPT-4 is expensive so we'd use a cheaper model where we can." This names the concepts without the math. The interviewer cannot tell whether you have done the calculation or are pattern-matching to vocabulary you heard. There is no number, no tradeoff, no viability test. It reads as a candidate who listed cost optimization strategies from a blog post.

The 2026 frame

Feasibility is nearly free. Any team can ship an AI feature in a week using current tooling and API access. The hard PM work has shifted entirely to viability: can the margin math work at the session cost your architecture produces? LLM unit economics is no longer something you defer to engineering. It is the primary product design constraint. An architecture that lands at $0.04 per session instead of $4.20 is what makes a free tier possible, keeps you out of the inference cost trap when power users show up, and lets you iterate without a CFO veto on the next model call.

For the broader viability argument, see proving viability. For where feasibility-is-free leaves off and the PM job begins, see feasibility is free. For the architectural choice that often precedes the routing decision, see RAG vs fine-tuning vs prompting.