framework · metrics
HEART framework: Google UX metrics explained
Best for: "Measure success" and "define metrics" interview questions, especially for Google roles
HEART is Google’s user-centered metrics framework, published in 2010 at CHI by researchers Kerry Rodden, Hilary Hutchinson, and Xin Fu. It was built specifically because PULSE (Page views, Uptime, Latency, 7-day actives, Earnings) measured server health but couldn’t detect whether UX quality was improving. The philosophical move: HEART was designed to produce metrics measurable “on a large scale” using behavioral data, not just in lab studies. In interviews, it fails not when candidates don’t know the acronym but when they recite all five letters without making a single decision.
What HEART stands for
Each category pairs with what it actually measures in practice:
- Happiness: User attitude and satisfaction, typically survey-based. CSAT, NPS, app store rating, post-task rating. Lagging and manipulable: users often report satisfaction even when dissatisfied (supportive bias), and in AI products, users rate incorrect answers highly because the model sounds confident.
- Engagement: Depth and frequency of use by active users. Sessions per user, features used per session, DAU/MAU ratio, average dashboards created per user per month. Engagement can be gamed by dark patterns: notifications that drag people back don’t make the product better.
- Adoption: New users or existing users starting to use a specific feature. Percentage of eligible users who tried a feature within the first N days, percentage of new users (under 30 days) using a feature at least once. Not useful for mature, universally-installed products like Google Maps at the product level, but highly useful as a cohort-specific feature signal.
- Retention: Returning users. 30-day return rate, churn, subscription renewal, percentage of existing users using a feature two or more times per month. Engagement without retention is vanity: high sessions from one-time visitors tells you nothing about whether the product is actually working.
- Task Success: Efficiency and error rate. Completion rate, time-on-task, mid-flow abandonment, error rate. The metric that separates whether users finished what they came to do.
The GSM process: the actual interview skill
Naming the five letters is table stakes. What separates strong from weak answers is running Goals-Signals-Metrics (GSM) before you touch a dashboard:
- Goals: What does success look like for this product or feature, in terms of user benefit? Not “increase engagement” but “help users reach destinations confidently without distraction.”
- Signals: What user behaviors or attitudes indicate progress toward that goal? Route completion without mid-trip abandonment; returning to navigate on the next trip.
- Metrics: How do you quantify those signals reliably at scale? Route completion rate, mid-route abandonment rate, percentage of users who navigate again within 14 days.
Run this chain out loud in an interview. It shows the interviewer you’re not filling in a template: you’re connecting what the product is supposed to do for users to what you’d actually put on a dashboard.
Worked example: Google Maps navigation
Goal: Users reach their destination without needing to pull over, re-plan, or abandon the route.
Signals: Route completion without cancellation; return navigation within 14 days (indicating the user trusted the product enough to come back); post-trip satisfaction rating.
Metrics to track:
- Task Success: route completion rate; mid-route abandonment segmented by road type (a spike on a specific route type flags map data quality issues)
- Retention: percentage of users who open navigation again within 14 days
- Happiness: post-trip in-app rating (one tap, immediately after arrival, low friction)
Letters to drop: Adoption is less useful at the product level because Maps has near-universal install in most markets. Reframe it as feature adoption for a specific cohort: new users in a new market, or users who’ve never tried live traffic routing. Engagement (sessions per user) is not the core signal: a user who opens Maps once a week for a reliable commute is succeeding; optimizing session count could push toward frivolous re-opens.
This selection and the explicit dropping of two categories is what makes the answer strong. Saying “I’d use all five” is the rote answer.
Which letters to prioritize by product type
| Product situation | Lead with |
|---|---|
| New feature, low awareness | Adoption + Task Success |
| Mature consumer product | Retention + Happiness |
| Enterprise workflow tool | Task Success + Engagement (depth) |
| AI assistant or agent | Task Success + Retention (skip Happiness unless you trust your survey design) |
| End-of-life or deprecated feature | None: don’t invest UX measurement in something you’re shutting down |
Use it, do not recite it
The weak answer names all five letters, assigns one obvious metric to each (NPS, DAU, signups, 30-day return, completion rate), and treats the job as done. It fails because: it gives the same answer for any product; it skips GSM so there’s no logic connecting goals to metrics; it doesn’t explain which two or three letters matter most here; and it doesn’t name a metric you’d actually act on versus one that sits in a dashboard.
The skill is selecting the right two or three categories for this product at this stage, running GSM to arrive at specific metrics, and explicitly dropping the letters that don’t apply, with a sentence explaining why.
strong
"Before I name any metric, let me run GSM. The goal for Google Maps navigation is to help users reach destinations confidently without distraction. The signal I care about is whether users complete routes without abandoning mid-trip, and whether they come back for the next one. That maps to Task Success (route completion rate, mid-route abandonment) and Retention (users navigating again within 14 days). I'd add Happiness via the post-trip in-app rating Google already surfaces, because it's low-friction and immediately post-task. I'm dropping Engagement as the lead metric because a user who opens Maps once a week for a reliable commute is succeeding; session count isn't the point. I'm dropping Adoption at the product level because Maps has near-universal install, though I'd reframe it as feature adoption for new users in a specific market. If route abandonment spikes above baseline on a specific road type, that's actionable: investigate map data quality. NPS alone isn't actionable at that resolution."
"One 2026 note: for AI products, I'd supplement Task Success with what I'd call a deflection-appropriateness signal: when the assistant says it can't help, is that the right call? HEART has no Trust letter, and that's a gap worth naming out loud. Task Success can score high while the agent solves the wrong problem entirely."
weak
"For Happiness I'd use NPS, for Engagement I'd track DAU, for Adoption I'd look at new signups, for Retention I'd check 30-day return rate, and for Task Success I'd measure completion rate." This fails on every dimension: it treats HEART as a checklist, skips GSM entirely, gives the same answer for any product, doesn't say which two or three letters actually matter for Google Maps, and doesn't name a metric that would trigger a real decision. The interviewer is checking whether you'd put this on a dashboard and act on it, not whether you can name five categories.
The 2026 AI product blind spot
HEART was designed for web apps where users act intentionally. In 2026, AI and agent products act on behalf of users, which breaks two categories structurally.
Task Success becomes ambiguous: an agent can complete a task (high success rate) while solving the wrong problem. The user asked for a summary; the agent produced a summary of the wrong document. Task Success scores 100%; the user is worse off. HEART has no mechanism to detect this.
Happiness is confounded by sycophancy: users rate AI-generated answers highly even when the answers are incorrect, because the output sounds fluent and confident. NPS and CSAT become lagging and unreliable signals for model quality.
What strong 2026 PMs add alongside HEART: deflection-appropriateness rate (when the system refuses, was the refusal correct?), hallucination rate from eval harnesses, and trust calibration (do users act on AI output in situations where acting on incorrect output is costly?). These aren’t in HEART. Naming that gap in an interview, rather than pretending HEART covers everything, is the signal that you’re thinking about the actual product rather than filling in a template.
The viable and lovable bar connects here: Engagement without Retention is vanity. Happiness without Task Success is decoration. A product is lovable when users return because it genuinely works, not because they said they were satisfied in a survey they answered to avoid conflict.
See also: North Star Metric, AARRR pirate metrics, measure success for Google Photos, and lovable, not just usable.