framework · metrics
GAME framework: goal, actions, metrics, evaluation
Best for: Execution rounds, "how would you measure success for X?" interview questions
GAME (Goals, Actions, Metrics, Evaluation) is the recognized scaffolding for “how would you measure success for X?” in PM execution rounds at Meta, Google, and Amazon. Arriving at the framework is table stakes. What separates a pass from a strong pass is what you do inside each step, especially Evaluation, and whether the whole answer holds up for an AI product without defaulting to DAU.
The four steps
Goals. Name a specific, testable outcome, not a direction. In 2026, feasibility is not a real constraint at most companies, so the meaningful tension is between viable (is there a margin-generating business here?) and lovable (does this proactively meet users where they work, or does it just check a box?). A strong Goal names which tension is live for this product and what “success” looks like in user terms before moving to business terms.
Actions. This is where most candidates fail. They list outputs: opened the panel, triggered a suggestion, clicked share. The corrected version maps behaviors to value-delivery events. For an AI product, the key action is “user acted on the AI’s output and got the job done,” not “user opened the AI panel.” Ask: what behavioral signal proves value actually transferred?
Metrics. Name three things, not one:
- Primary metric, defined precisely: not “retention” but “percentage of users who complete at least one task per week, measured per cohort from day 1.”
- Guardrail metric: the number that must not regress while you optimize the primary (e.g., error rate must stay below 3%).
- Diagnostic metric: something that explains why the primary moved, not just that it did (e.g., post-response user flow: percentage who act, refine, abandon, or escalate).
Interviewers at Meta and Google specifically probe for guardrails. An answer with no guardrail metric will not pass the strong-pass bar.
Evaluation. The most commonly flubbed step. Candidates treat it as a summary (“these metrics together give us a complete picture”). A real Evaluation stress-tests the metrics. Ask: could this metric rise while actual value falls? (False positive: users submit more prompts because the first answer was bad.) Could value rise while the metric is flat? (False negative: a user completes a task on the first try and never returns, which looks like churn.) Does the metric surface guardrail violations? (Aggregate uptime at 99.9% tells you nothing if the model is hallucinating on 30% of responses.) Close with a recommendation: which metric is primary, what threshold triggers a review, and what you do if the guardrail fires.
AI products specifically
Traditional metrics are structurally misleading for AI products. A user who submits one prompt and acts on the result delivers more value than one who submits ten because the first nine failed. DAU and session length tell you the user showed up.
AI-native metrics now expected in strong answers: autonomy rate (percentage of tasks resolved without re-submission or human intervention), post-response user flow as a quality proxy without an expensive labeling pipeline, and token cost per completed task as the efficiency side of unit economics. Only 40% of product teams measure AI ROI through business outcomes; 60% use time-saved proxies. Interviewers at Anthropic, OpenAI, and Google DeepMind probe for this distinction.
Strong vs. weak
strong
"My goal for Cursor is: developers complete tasks they couldn't easily do before, at an acceptance rate that sustains model cost. The real tension is lovable: are they returning because Cursor anticipated their workflow, or just because setup friction keeps them? Under Actions, the value event is an accepted suggestion that shipped without reversion, not a completion triggered. Primary metric: weekly accepted lines per developer cohort. Guardrail: bug rate on accepted lines below 2%, proxied by PR reverts within 24 hours. Diagnostic: post-response flow showing what percentage act, refine, or abandon. For Evaluation: if acceptance rate rises while lines per acceptance falls, suggestions are getting trivially short. That's a false positive. If bug rate crosses the guardrail while accepted lines grow, I stop optimizing the primary and investigate."
weak
"My goal is to increase engagement. Under Actions: users open the app, trigger a suggestion, accept it, share output, and return. For Metrics I'd track DAU, session length, and retention. For Evaluation, these metrics together give us a full picture of engagement health." This fails at every step. The goal is undefined and untestable. The actions are outputs, not outcomes. The metrics are activity proxies that can all rise while the product delivers no value. The evaluation adds no stress-test and names no guardrails. No pass at Meta, Google, or any frontier AI company.
Use it, do not recite it
Speed-running the acronym is the failure mode. Interviewers can read the framework. What they want is a candidate who uses the structure to arrive at a specific, guardrail-paired metric for the product in front of them, then genuinely stress-tests it. Practice on unfamiliar products: apply GAME cold, run the false-positive and false-negative checks in the Evaluation step, and name a guardrail every time.
See also: north star metric, measure success for Instagram Stories, and two metrics in conflict.