ai · hard

How do you measure success for a new AI product?

How would you define goals and measure success for a new AI product?

Updated Jun 2026 Calibrated to the strong-hire bar

This question tests whether you understand what makes AI metric design different from standard product metric design. The non-determinism problem, the Goodhart trap, and the offline-to-online transition are all in scope. A generic answer (DAU, retention, NPS) fails even with clean structure, because interviewers at frontier labs weight AI-specific fluency at roughly 25% of the score.

What the question is really asking

Before you name a single metric, state the product’s job-to-be-done and identify whose success you are measuring: the end user, the business, and the model. Most candidates skip this and jump straight to a metric list. OpenAI interviewers specifically probe whether you can articulate what “success” means before quantifying it.

The two-axis goal frame keeps you honest. Axis one: is the problem being solved? (user and business outcomes.) Axis two: is the model doing its job? (output quality, reliability, safety.) A complete answer covers both.

Structure a strong answer

strong

"This is an AI writing assistant embedded in a sales CRM. The user job is drafting accurate, on-brand outbound emails faster than they could manually. The business goal is reducing time-to-send and increasing rep productivity, which drives pipeline. The model's job is generating drafts that require minimal editing.

My north star is draft acceptance rate: the share of AI-generated drafts a rep sends without substantive edits. It sits at the intersection of user trust and model quality. If reps are editing heavily, the model is failing them. If they're not editing at all, I'd cross-check against reply rates to rule out rubber-stamping.

I'd layer in three categories of supporting metrics. User outcome: time-to-send (did AI compress the workflow?), adoption rate (what share of reps use AI drafts weekly?), and return rate after first use. Model quality: edit distance between AI draft and sent email, task completion rate on structured fields like salutation and CTA (target above 90%), and hallucination rate on factual claims such as contact names and company details. System health: P50/P95 latency to first draft, and a weekly eval suite on held-out prompts to catch model drift before users do.

Because LLM outputs are stochastic, I wouldn't track accuracy on a single run. I'd use a Pass@k approach: measuring whether the model produces an acceptable draft across multiple generations for the same input. A system that scores 60% on one run but 25% across eight runs is unreliable, regardless of peak accuracy. That's the canonical answer to the non-determinism problem.

The counter-metric matters as much as the north star. Optimizing draft acceptance rate in isolation is a Goodhart trap: reps may start rubber-stamping bad drafts and the metric improves while actual output quality degrades. So I'd pair it with downstream signal, reply rate and meetings booked, on AI-assisted emails versus manual at 30-day cohorts.

Before launch, I'd run an offline eval on 500 held-out email samples rated by senior reps on accuracy, tone, and brand compliance. That eval suite becomes my regression test in CI: any model update must match or exceed the baseline on those dimensions before it ships. That's the offline-to-online bridge."

weak

"I'd track DAU, retention, and NPS, plus some model metrics like precision and recall." This is the answer for any product. It signals the candidate hasn't engaged with what makes AI products different: stochastic outputs, metric gaming, model drift, and the gap between "users engaged" and "users accomplished the job." Naming thumbs-up feedback as the primary quality signal is the specific tell that someone hasn't shipped an AI feature in production.

AI-specific metrics worth naming

These are the signals that separate candidates who have shipped from those who have read about it.

  • Draft acceptance rate / task completion rate. For structured agentic tasks, well-implemented agents achieve 85-95% autonomous completion. Name a target, not just the metric.
  • Pass@k (also called Reliable@k). Measures consistency across k variant prompts, not peak performance. This is the direct answer to non-determinism.
  • Edit distance. For generation tasks, how much do users change the output before using it? Lower is better, but zero warrants investigation.
  • Hallucination rate. Calibrated benchmarks matter here: GPT-4o runs around 1.5%, Claude 3.5 Sonnet around 4.6%, Llama-3.1-405B around 3.9%. Candidates who cite calibrated numbers signal real fluency.
  • Model drift indicators. AI products require ongoing monitoring of metric stability over time, not just launch-time measurement. A weekly eval suite on held-out inputs catches regressions before users do.
  • Trust calibration. Do users rely on AI output when they should, and push back when they shouldn’t? High acceptance on low-quality outputs is a trust miscalibration problem, not a success signal.

The 2026 bar

Feasibility is no longer the constraint. Any AI feature can be built in weeks. What’s scarce is viability (does the value delivered justify the cost-per-query at the margin?) and genuine lovability (does the AI meet users where they work, anticipate their needs without being obnoxious, and earn enough trust that they rely on it rather than just try it once?).

The metric question is really asking: can you distinguish between an AI product that demos well and one that users trust enough to change their behavior around? A strong answer includes at least one metric that would catch “impressive but useless”: high thumbs-up with low actual task completion, or high trial with fast churn. That’s what OpenAI and Anthropic interviewers are probing for.

One specific probe reported from OpenAI’s Round 3: “What if instrumentation went down?” The right answer names proxy data sources and user feedback loops you’d use to reason about the product’s health without telemetry. It’s a test of whether you think in terms of living systems, not static launches.

For more on the eval side of this question, see building an eval harness and how to prove viability for an AI product.