framework · metrics
North Star Metric: define one in an interview
Best for: Analytical rounds, "define success" questions, and product sense interviews
A north star metric is the single number that best captures the value your product delivers to users. In interviews, the skill is not defining the term but arriving at the right metric, fast, for a product you may never have thought about before. The failure mode is naming an activity metric (MAU, sessions, signups) and dressing it up as a value metric. A strong NSM also passes a 2026-specific test: if the metric improves, are users genuinely better off and willing to keep paying, or are they just stuck?
The three-filter test
Before committing to a metric, run these three checks in order:
- Does it go up when users get more value and down when they don’t? If you can imagine the metric improving while user satisfaction falls, it’s the wrong number.
- Is it a leading indicator of revenue, not revenue itself? Revenue is lagging and can’t guide weekly team decisions. The NSM should predict it.
- Can it be gamed without genuinely helping users? If yes, it will be gamed. Drop it.
“Weekly active users” fails filter one. “Revenue” fails filter two. “Total push notifications sent” fails filter three. A metric that clears all three is worth defending.
How to pick one under pressure
Given a product you’ve never analyzed, a fast selection process matters more than memorized examples. Work through these steps in under 60 seconds:
- Classify the product type first. Consumer attention product (time-based value), transaction marketplace (completed exchange), or productivity tool (efficiency gained)? The type narrows the candidate metric space immediately.
- Identify the core value exchange. What does the user get? What does the business capture when the user gets it?
- Follow the user’s output, not their activity. Activity tells you they showed up. Output tells you the product worked.
- Find the threshold. Slack’s NSM was not “messages sent” but 2,000 messages sent within a team, because below that threshold teams rarely reported feeling genuine value. The threshold reveals when value tips from potential to actual.
- Apply the reverse test. If this metric improves, can you imagine the product being worse? If the answer is yes, you have a vanity metric. Discard it.
Company examples with the reasoning
Facebook: “Percentage of users who added 7 or more friends in the first 10 days.” The threshold was chosen because it predicted long-term retention better than any other early signal. Note: it’s not “friends added,” it’s a binary pass/fail at a threshold correlated with the behavior that drove retention.
Airbnb: Nights booked captures value on both sides of the marketplace simultaneously. A host earns, a guest stays. A single metric that serves neither side well would require two separate north stars and lose the cross-side accountability. For two-sided marketplaces, look for the unit that represents a completed transaction of value, not traffic or intent. (Compare: “host listings active” measures supply but not demand; “searches performed” measures intent but not delivery. Only nights booked closes the loop.)
Netflix: Weekly viewing hours, not monthly or total. The weekly cadence surfaces churn risk before it becomes revenue loss. Monthly would smooth over the early warning signal; total is not actionable.
Slack: 2,000 messages sent within a team. Below that threshold, teams hadn’t experienced the product’s core loop deeply enough to report real value. The number was derived from cohort analysis on retention, not invented as a round-number goal.
Spotify: Raw hours listened fails the gaming test because autoplay inflates it without reflecting genuine user preference. A tighter proxy: time spent listening to music the user explicitly saved or followed. This requires an active preference signal, predicts subscription renewal, and resists gaming because you can only move it by matching users with music they actually want.
The AI product angle
In 2026, candidates are regularly asked to define a north star for a copilot, coding assistant, or autonomous agent. The consistent trap is picking a metric that measures the AI’s activity rather than the user’s output.
“Suggestions shown” measures the system. “Tokens generated” measures cost. “Tasks completed by the agent” measures the agent. None of these answers the question: did the user get something shipped that they were satisfied with?
For AI products, anchor the NSM downstream of the AI on the user’s actual output:
- Coding assistant: accepted lines of code per active developer per week. It goes up only if suggestions are good enough to use, and falls the moment quality degrades.
- AI writing tool: documents published that include at least one accepted AI suggestion. Tracks value delivered, not suggestions offered.
- Agent product: tasks the user accepted and shipped without re-doing them. The re-do rate is the failure signal.
This reframe matters because in 2026, alternatives are fast to build and easy to switch to. A north star anchored on AI activity hides whether users found the product genuinely irreplaceable or were just along for the ride. The viable/lovable question applied to measurement: if the metric goes up, are users returning because the product is worth it, or because switching is slightly inconvenient?
NSM vs. OMTM vs. guardrails
The OMTM (One Metric That Matters) is a tactical focus metric for a specific growth phase (e.g., improve activation rate this quarter). The NSM is strategic and persistent, tied to the product’s core value loop, not a current bottleneck. Interviewers may probe this distinction explicitly; the short answer is that OMTM changes, NSM should not.
Guardrail metrics prevent NSM optimization from creating silent product harm. If you are tracking nights booked, a guardrail is host cancellation rate: you don’t want a supply squeeze to inflate bookings temporarily before causing churn. If you are tracking accepted lines of code for a coding assistant, guardrails are latency (time to first suggestion) and bug rate on accepted lines.
Always name one or two guardrails alongside the NSM. It’s the maturity signal interviewers are looking for, and it shows you’ve already thought about how the metric gets gamed.
strong
"For Spotify I'd pick time spent listening to music the user explicitly saved or followed, not raw listening hours. Raw hours are inflated by autoplay on content the user never chose. Saved and followed listening requires an active preference signal, predicts subscription renewal, and resists gaming because you can only move it by connecting users with music they actually want. My guardrail would be skip rate on recommended content, so we don't juice the NSM by flooding saves with algorithmic noise."
"For a coding copilot I'd pick accepted lines of code per active developer per week. It sits downstream of the AI on the user's actual output: it goes up only if suggestions are good enough to ship, and it falls the moment quality drops. Guardrails: time-to-first-suggestion for latency and bug rate on accepted lines to keep quality honest."
weak
"I'd use monthly active users as the north star." MAU and sessions are inputs, not outputs of user value. They can rise while satisfaction falls, because users log in to fix broken things or find workarounds. This answer tells the interviewer you haven't separated vanity metrics from value metrics, and that you'd optimize a team toward a number that doesn't predict revenue, retention, or real product health. A close second: picking revenue directly, which confuses what users pay with what they value, and is a lagging indicator that can't inform weekly decisions.
Revenue as NSM: why it always fails
The most common fatal mistake in interviews. Revenue measures what users pay, not what they value. It’s a lagging indicator: by the time revenue drops, you’ve already missed the behavioral signals that would have told you why. The NSM should predict revenue, not be it.
Use it, do not recite it
Listing properties of a good NSM is not a strong answer. Arriving at a specific, threshold-anchored, behavior-rooted metric for the product in front of you is. Practice on unfamiliar products: take a product you’ve never worked on, classify it (attention/transaction/productivity), apply the three filters, and land on a metric you can defend against the gaming and reverse tests. That’s the skill the interviewer is evaluating.
See also: measure success for Instagram Stories, two metrics in conflict, and the glossary entry.