ai pm · thesis
How to answer a cost-per-query question in an AI PM interview
Cost-per-query is the new “does this scale?” question, and interviewers at OpenAI, Anthropic, Google DeepMind, and AI-native startups are now filtering on whether you can do the arithmetic live. A vague answer (“we’d use a cheaper model”) signals that you have used LLMs but have never shipped one at scale. The strong answer opens with a formula, anchors it to a worked example with real numbers, names the reasoning-token trap, and closes with the viability gate: the cost threshold where the feature stops being margin-positive.
The formula and what drives it
CPQ = (input_tokens × input_rate + output_tokens × output_rate) / 1,000,000
Output tokens cost 2 to 5 times more than input tokens. The reason is mechanical: generating each output token requires a full forward pass through the model, while input tokens are processed in parallel during the prefill stage. At GPT-4o pricing (June 2026: $2.50/M input, $10.00/M output), a 500-token prompt returning a 200-token answer costs $0.00325. That sounds cheap until you multiply by volume.
Current model prices you should have memorized
| Model | Input ($/M) | Output ($/M) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Claude Haiku | $0.25 | $1.25 |
| Gemini 2.5 Flash | $0.30 | $2.50 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
Gemini 2.0 Flash is the current budget champion. Claude Sonnet 4 has the steepest output rate, which makes the reasoning-token trap especially dangerous there.
The worked example: support tickets at scale
Assume a support agent handling 2 million tickets per month, 5 turns per ticket, 500 input tokens and 200 output tokens per turn. That is 3,500 tokens per ticket (2,500 input, 1,000 output).
At GPT-4o-mini ($0.15/$0.60 per M): cost per ticket = (2,500 × 0.15 + 1,000 × 0.60) / 1,000,000 = $0.000375 + $0.0006 = approximately $0.001 per ticket, or about $2,000 per month total. Against a baseline human cost of $1M per month, the unit economics are clear.
At GPT-4o ($2.50/$10.00 per M): cost per ticket = (2,500 × 2.50 + 1,000 × 10.00) / 1,000,000 = $0.00625 + $0.01 = $0.01625 per ticket, or $32,500 per month. Still viable, but now 3.3% of the savings are going back to inference.
At Claude Sonnet 4 with a reasoning model enabled: visible output is 200 tokens, but the internal chain-of-thought can be 600 to 10,000 tokens. A “simple” support reply billed at a 5,000-token reasoning trace costs (2,500 × 3.00 + 5,000 × 15.00) / 1,000,000 = $0.0825 per ticket, or $165,000 per month. That erases most of the labor savings.
The reasoning-token trap
This is the most common blind spot in candidate answers. Reasoning models (o1, o3, Claude thinking mode) bill for internal chain-of-thought tokens that never appear in the visible output. A 200-token answer can consume 600 to 10,000 internal tokens. The visible output token count is not the cost. If you are evaluating a reasoning model, you need to instrument your evals to capture total tokens billed, not just the response length.
At Sonnet 4’s $15/M output rate, a session that hides 10,000 reasoning tokens costs $0.15 in output tokens alone before adding the input. A 5-turn session at that rate runs $0.75 or more. Compare that to $0.50 as a rough human agent cost per ticket: the AI feature is now more expensive than the labor it was meant to replace. That is not a technical failure; it is a viability failure.
The mitigation: tiered routing and caching
A strong answer proposes three concrete mitigations.
Tiered routing. Classify queries by complexity before routing. A cheap classifier (GPT-4o-mini or a fine-tuned small model) adds roughly $0.0001 per query and can route 70% of traffic to budget models, 20% to mid-tier, and 10% to frontier. That split cuts average CPQ by 60 to 80% compared to routing everything to the same model.
Prompt caching. Anthropic charges 10% of the normal input rate on cache hits; OpenAI offers up to 90% savings on repeated prefixes. System prompts and shared context blocks are the obvious targets. On a 500-token system prompt sent 2 million times per month, caching saves roughly $2,250 per month at GPT-4o input rates.
Batching. Non-urgent queries (async summaries, nightly digests, document classification) can be batched for off-peak processing. Several providers offer 50% discounts on batch jobs with latency tolerances of minutes to hours.
The viability gate
Connecting CPQ to business viability is what separates a PM answer from an engineering answer. A useful rule of thumb: CPQ should be below 20% of the value the query delivers. If an AI support reply saves $0.50 in human cost, CPQ should be under $0.10. If it is a legal research query worth $50 in billable hours, you have $10 of headroom.
If CPQ is above that threshold, the feature is not viable at current scale, even if users love it. Feasibility is cheap; inference is not free. The PM’s job is to run this arithmetic before the CEO does, then design the routing and caching strategy that brings CPQ to a viable number.
strong
"I start with the formula: CPQ equals input tokens times input rate plus output tokens times output rate, divided by a million. Let me work a real example. For a support agent handling 2M tickets per month at 5 turns, 500 input and 200 output tokens per turn, that is 3,500 tokens per ticket. At GPT-4o-mini ($0.15/$0.60 per M), CPQ is about $0.001 per ticket, or $2,000 per month total, versus $1M in human cost: viable. At GPT-4o ($2.50/$10 per M), it is $0.016 per ticket or $32,500 per month: still viable, but watch the margin. The trap I flag immediately: if we use a reasoning model, visible output is not total output. A 200-token visible reply with 5,000 internal reasoning tokens at Sonnet 4 pricing runs $0.08 per ticket, or $165,000 per month. That erases the business case. My mitigation: route 70% of traffic to Gemini 2.0 Flash or GPT-4o-mini with a cheap classifier, cache system prompts (Anthropic bills cache hits at 10% of input rate, OpenAI at up to 90% off), and batch async queries. That combination cuts average CPQ by 60 to 80%. My viability rule of thumb: CPQ should be under 20% of the value the query delivers. For this support ticket, target under $0.10. If we cannot get there with routing and caching, we revisit the feature scope."
weak
"We would use a smaller, cheaper model to keep costs down, and maybe cache some responses." No formula, no numbers, no mention of input versus output token asymmetry, no reasoning-token trap, no connection to margin or viability. This answer signals the candidate has used LLMs but has not shipped one at scale. Interviewers at frontier labs have said explicitly they are filtering for candidates who can do this math live. Vague cost awareness is the new "I'd A/B test it."
Why this question is in every frontier loop now
In 2026, feasibility is not the gate. Spinning up inference takes an afternoon. The real question is whether CPQ makes the feature viable (margin-positive at the scale you are projecting) and lovable (fast and reliable enough that users trust it, not just tolerate it). A $4+ session cost for an AI support agent is a viability failure because a human agent at a BPO costs roughly $0.50 per ticket. The AI is not cheaper. Cost-per-query is where product strategy and inference economics meet, and every serious AI PM interview now tests whether you can hold both at once.
For the broader unit economics context, see LLM unit economics. For the framework that situates CPQ inside the viability gate, see proving viability and feasibility is free.