rca · hard

"DAU dropped 20%, find the root cause"

Daily active users dropped 20% week over week. Find the root cause.

Updated Jun 2026 Calibrated to the strong-hire bar

This round tests diagnosis under pressure. The failure mode is anchoring on the first plausible cause (“a competitor launched”) and chasing it before ruling out anything else. Interviewers watch the order you move in, not just the hypotheses you name. RCA questions make up roughly 22% of PM interviews at senior levels, second only to product design, so this is high-frequency and high-stakes.

Two bisects before any segmentation

Is the drop real? Measurement errors are the most common cause of apparent drops in instrumented products, and almost no candidate checks this first. Did a logging pipeline change? Did the metric definition update? Did a new SDK version alter what gets counted? In AI-native products, this last one is especially sharp: if “DAU” shifted from “any app open” to “sessions with at least one agentic action completed,” a definitional change can produce an apparent 20% drop overnight with no real behavior change. Before forming a single hypothesis, confirm the number is trustworthy.

Step-change or gradual trend? Pull the daily curve. A vertical drop on a specific date points to a release, an outage, or a platform event. A curve bending downward over two to three weeks points to a retention or engagement problem. These two shapes have almost no causal overlap, so getting this right before segmenting saves the whole investigation. A 20% one-week drop is almost always a step-change story. If you start listing retention hypotheses for a step-change, you will not pass.

The diagnostic sequence

Once you have confirmed the drop is real and identified it as a step-change, segment in this order:

  1. Platform and app version first. A bad iOS release hits iOS only. A bad Android release hits Android only. A server outage or backend change hits both. This single cut localizes cause faster than any other move.
  2. If one platform is affected: go straight to the release log. Look for a crash-rate spike, a broken auth flow, a bad permission prompt, or a missing dependency.
  3. If both platforms are affected: check infrastructure. Error rates, p99 latency, CDN logs, third-party auth providers. A backend incident is the most common cause of a symmetric drop.
  4. If infrastructure is clean: move to external causes. Did a competitor launch or ship a breakout feature? (Grok went from zero to 9.5M DAU in its first weeks, precisely the kind of entrant that can pull users overnight.) Did iOS 18 or Android 15 notification throttling suppress re-engagement pushes? These platform policy changes, introduced in 2025, suppressed re-engagement for consumer apps that relied on push, and almost no candidate names them.
  5. For AI-native products: add model-specific hypotheses before closing. A quantized model variant deployed to reduce inference cost can tank session quality without throwing an error. A model checkpoint rollback, a context-window reduction, or an eval regression are root causes unique to AI products. ChatGPT’s DAU/MAU ratio sits around 36%, roughly double Gemini’s 21%: in AI products, session quality is the actual driver of the DAU number, not just distribution or notifications.

For a gradual trend instead of a step-change, the sequence is different: start with cohort analysis to find which user cohort is churning faster (new users vs. established, by geography or device), then look for product changes or competitor movements that happened several weeks before the bend started.

What each company listens for

The question looks the same at Meta, Google, and Amazon. The scoring is not.

Meta expects you to stay on diagnosis before touching solutions. Their incident process starts with a five-minute “is this real?” check using Scuba, their real-time data platform, before any hypothesis formation. Candidates who pivot to fixes before localizing the cause fail the Meta version of this question. Depth of diagnosis is what clears the bar.

Amazon scores on Dive Deep from the Leadership Principles. “It’s a platform bug” is not a pass. “It’s a crash on the payment confirmation screen in build 14.2.1 on Android 14 devices with under 3 GB RAM” is closer. Interviewers will push you two levels below your first hypothesis. You should also name when you would flag a Tier 1 incident process.

Google wants statistical rigor. Name your confidence in the drop before hypothesizing. Is 20% outside the normal week-over-week variance? What is the sample size by platform? Ruling out noise explicitly, rather than just accepting the alert, signals statistical maturity.

strong

"Before I dig in: is this drop confirmed real? Logging changes, metric definition updates, or new SDK versions cause apparent drops more often than people expect. In an AI product I'd also ask whether the definition of 'active' changed, because shifting from 'any open' to 'agentic action completed' can drop the number 20% with no real behavior change. Assuming it's confirmed real, my first cut is shape: step-change or gradual trend? A vertical drop on a specific day is a release, outage, or platform event. A slow curve over two to three weeks is a retention problem. Those have almost no causal overlap, so I won't mix hypotheses from both buckets. A 20% one-week drop looks like a step-change.

I'd segment by platform and app version first: a bad iOS build hits iOS only, a backend outage hits both. If it's one platform, I go straight to the release log for crash rates and broken flows. If it's both, I check infrastructure: error rates, p99 latency, CDN, third-party auth. Only if those are clean do I move to external: competitor launch, or iOS 18 or Android 15 notification throttling suppressing re-engagement pushes. For an AI product I'd also check whether a model checkpoint changed. A quantized model can tank completion quality without throwing an error, and users stop returning before any alert fires.

Once I've localized the cause, I'd call two actions: an immediate one (rollback or hotfix if internal, an outage comms if external) and a medium-run one to close the detection gap so a future release doesn't go 24 hours before we catch it."

weak

"A competitor probably launched something, or there's seasonality. I'd look at user feedback and run a marketing campaign to win users back." This fails three ways. First, it skips the instrumentation check entirely, which means you would miss the most common real cause of apparent drops. Second, it anchors on a single external hypothesis without eliminating internal causes, which is almost never how real investigations go. Third, it proposes a solution before identifying the cause: that is exactly the pressure-driven mistake this question is designed to surface, and experienced interviewers watch for it specifically.

How to close

End with a split recommendation keyed to cause, not to a generic recovery plan. If internal: rollback or restore the model checkpoint; instrument the gap that let this go 24 hours undetected. If external: assess whether the drop is structural (a platform policy that stays) or transient (a competitor feature launch that users will evaluate and partially return from), and calibrate the response accordingly.

In 2026, a DAU drop is almost never about what you couldn’t build. Feasibility is table stakes. The question is whether the product is still earning its place in the user’s day against a world full of agents that act on their behalf. Frame the recommendation at that altitude: is this a fixable execution failure, or a signal that the product is no longer clearing the viable threshold?

For metric definitions, see DAU/MAU. For a related diagnostic pattern, see Facebook Groups dropped and notifications up, time on site flat.

Asked at