Model drift

Model drift is the umbrella term for AI performance degradation caused by three distinct problems: the input data shifting away from training data (data drift), the world changing such that the right answer changes (concept drift), or what counts as correct changing because annotation guidelines or business rules evolved (label drift). Treating all three as the same problem with the same fix is one of the most reliable ways to give a weak answer in an AI PM interview. The fix for data drift is retraining. The fix for concept drift might be changing the product behavior entirely, not the model. And in 2026, there is a fourth type that did not exist in classic ML: provider-side drift, where OpenAI, Anthropic, or Google silently updates the underlying weights behind an API call and your outputs change without any action on your side.

The taxonomy

Data drift (also called covariate shift): the distribution of inputs changes, but the relationship between inputs and outputs has not. A fraud detection model trained on checkout flows from pre-2024 starts misfiring when AI purchasing agents begin making transactions autonomously. The inputs are structurally different from anything in the training set. Fix: retrain with representative new data.

Concept drift: the underlying relationship between inputs and outputs changes because the world moved. A churn-prediction model trained when users held one subscription now misfires because the same users hold five simultaneous AI tool subscriptions. The feature “number of logins” no longer predicts churn the way it did. Fix: may be retraining, or may be a product behavior change if the signal itself is no longer meaningful.

Label drift: what counts as a correct output changes. Annotation guidelines shift, the definition of a positive class expands, or a regulatory change redefines a category. Fix: update the ground truth labels and retrain against the revised definition.

Provider-side model drift (the 2026-specific type): your provider silently updates the underlying model. Your product changes without a deploy. This is not a data or concept problem; it is a contractual and eval problem. The PM job is version-locking API calls where behavioral stability matters and running regression evals on every provider model update.

Prompt drift: accumulated edits to system prompts over months degrade performance on edge cases without a clear triggering event. Common in products with many stakeholders touching the prompt.

The silent failure mode

Drift almost always surfaces in product metrics before it shows up in model accuracy dashboards. Engagement drops. Support escalations increase. Conversion goes flat. The Spotify COVID example is the canonical real-world case: commuting stopped in 2020, listen patterns shifted from commute sessions to home listening, and skip rates and session-length metrics flagged the drift before any accuracy metric moved. The lesson is that PMs who watch only model metrics catch drift late. Business metrics are the leading signal. If you are only reviewing model performance dashboards, you are monitoring the wrong thing first.

Detection: what to know without needing the formulas

Two statistical tests appear in production monitoring dashboards and sometimes in interviews. You do not need to compute them; you need to know what they detect and what a threshold means.

Population Stability Index (PSI): measures how much an input feature distribution has shifted compared to training. Common rule of thumb: PSI below 0.1 is stable, 0.1 to 0.2 warrants investigation, above 0.2 is significant drift requiring action. This is the score a monitoring tool will surface. When you see it, you should know it means the inputs your model is receiving look materially different from what it trained on.

Kolmogorov-Smirnov (KS) test: a statistical test for whether two distributions are drawn from the same underlying distribution. Used to compare current input distributions to training distributions on a feature-by-feature basis.

Monitoring tools PMs should know exist and can speak to: Evidently AI (open source, statistical drift tests across data and concept drift), Arize AI (LLM observability with trace-level drift), Fiddler AI (enterprise model monitoring with explainability), WhyLabs (data and model monitoring). You do not configure these. Your job is to ensure they are in the roadmap from day one, to define what alert thresholds trigger escalation, and to own the documented decision criteria for what happens when those thresholds are crossed.

The PM decision tree

When a drift signal appears, the right response depends on the type of drift and the recency of changes.

Did you or your provider recently update the model? Roll back to the previous version and regression test, or freeze the provider version and add evals to the update review process.
Did input distribution shift (PSI above threshold, business metrics diverging from model metrics)? Trigger the retraining pipeline with new representative data. This is data drift.
Did the world change such that the correct answer is different? This is concept drift. The fix may be retraining on post-shift data, but it may also be a product decision: if the signal is no longer valid, no amount of retraining recovers it. Redesigning the feature may be the right call.
Is the failure narrow but high-stakes, covering a specific class of inputs that degraded? Add an eval to catch that failure class. Add a guardrail or human-in-the-loop review gate for the affected decision type.
Is the world change permanent and the model’s previous behavior now actively wrong? Accept the drift and adjust. The product behavior needs to change, not just the model.

Reflexively saying “retrain” is the weak answer. Retraining is expensive and slow to iterate. Rolling back, adding evals, adding guardrails, or accepting and adjusting are often faster and cheaper depending on the type of drift.

What the PM owns, and what MLOps owns

MLOps owns execution: running retraining pipelines, configuring monitoring infrastructure, deploying model updates, managing the evaluation harness. The PM owns the criteria that drive those decisions.

Specifically, the PM is responsible for:

Getting monitoring dashboards and alert thresholds into the roadmap before launch, not retrofitting them after a degradation incident
Documenting the retraining trigger criteria: which metric, which threshold, which business condition
Defining the go/no-go criteria for accepting a new model version or a provider update
Running the eval suite on every provider model update (including silent ones) and owning the decision to ship or hold based on results
Deciding whether a degraded class of inputs gets a guardrail, a human-in-the-loop step, or triggers a full retraining

The 2026 senior AI PM bar is explicit about this: interviewers expect candidates to have shipped something with post-deployment monitoring, not just launched an MVP. Questions like “tell me about a time your model degraded in production” are now standard at Google, Anthropic, Microsoft, and Meta. A candidate who has no answer to that question is implicitly admitting they have only shipped to launch, not managed a live system.

Interview expectations

strong

"Model drift is the umbrella term for performance degradation caused by shifts in data (data drift), shifts in the world (concept drift), or shifts in what counts as correct (label drift). In 2026, there is a fourth type that didn't exist in classic ML: provider-side version drift, where Anthropic or OpenAI updates the underlying model and your outputs change without any action on your side. The tricky part for a PM is that drift almost always surfaces in product metrics first: engagement drops, support escalations go up, conversion goes flat. I watch business metrics as my leading drift signal, not model accuracy dashboards. When I suspect drift, I ask: Did we or our provider recently change the model? If yes, roll back or regression test. Did input distribution shift? That's data drift; trigger retraining with new data. Did the world change such that the correct answer is different? That's concept drift; the fix might be a product behavior change, not just a retrain. Are we hitting a narrow but high-stakes failure class? Add an eval, add a guardrail, consider human-in-the-loop. What I own: the monitoring dashboard in the roadmap from day one, the retraining trigger criteria documented before launch, the eval suite that runs on every provider model update, and the decision of whether to retrain, roll back, or add a guardrail. MLOps owns execution. I own the criteria."

weak

"Model drift is when the model stops working well over time. You detect it by monitoring performance and then retrain the model." This answer fails on five counts: it conflates data drift and concept drift as if they are the same problem with the same fix; it offers retrain as the reflexive solution when rolling back, adding evals, or adding guardrails are often faster and cheaper; it names no signals, no monitoring tools, no decision criteria; it does not distinguish what the PM owns from what MLOps owns; and it has no 2026 specificity. Interviewers flag this as textbook recall without operational depth.

For building the eval system that catches drift regressions before they reach users, see eval harness for PMs. For the full post-launch monitoring workflow, see day two playbook. For the decision of when AI failures require human intervention, see when the AI is wrong.