ai · standard
"RAG vs fine-tuning vs prompting: how do you choose?"
When would you use prompting, RAG, or fine-tuning for an AI feature?
This tests whether you can scope an AI solution to the actual problem instead of reaching for the most complex option. The failure mode is treating “fine-tune a model” as a reflex answer. What the interviewer is actually checking: do you optimize for the cheapest thing that passes an eval, or do you posture toward sophistication? In 2026, feasibility is free. You can build any of these architectures. The decision is entirely about viability and lovability.
Structure a strong answer
Walk the escalation ladder from cheap to expensive. Choose by what the problem needs, and commit only when a simpler step fails a real eval.
The full ladder: prompt engineering → full-context prompting with caching → RAG → fine-tuning → distillation. Each step adds cost, latency, and maintenance overhead. Never skip a rung without evidence from an eval.
Step 1: prompt engineering. If the model already knows the domain and the task is stable, a well-crafted system prompt with few-shot examples often gets 80% of the way there. Don’t build infrastructure to solve what a prompt can solve.
Step 2: full-context prompting with caching. If the knowledge base is under roughly 200,000 tokens, loading it directly into context with prompt caching is often faster and cheaper than standing up a retrieval pipeline. Most PM candidates skip this rung entirely. Naming it signals real production awareness.
Step 3: RAG. When the answer depends on private, fresh, or large-scale knowledge (a help center that updates weekly, a user’s account history, a regulatory corpus), RAG gives grounded and citable responses, fast knowledge updates without retraining, and clear auditability when something goes wrong. Citability is a viability argument too: for support and compliance use cases, provenance is what builds user trust and satisfies audit requirements.
The main risk is not the retrieval algorithm. Roughly 40% of RAG production failures trace to data quality: stale chunks, bad metadata, and poor chunking strategy. A PM answer that names this is more credible than one that debates embedding models. Anthropic’s Contextual Retrieval work showed a 49% reduction in failed retrievals and a 67% improvement when combined with reranking, a concrete data point worth citing if your interviewer asks what state-of-the-art looks like.
Step 4: fine-tuning. Only when you need the model to behave differently: consistent tone, a specific structured output format, reliable function-calling patterns. You need labeled data with a stable target behavior. Without both, you’re spending GPU budget chasing a moving goalpost. Fine-tuning encodes behavior. It does not inject dynamic facts. Confusing those two is the most common failure in weak answers.
In 2026, the production pattern is usually both: RAG for volatile facts and citations, a LoRA or QLoRA adapter for stable behavior and output format. This is the composable adaptation stack. Don’t reach for it on day one, but know it exists and explain when you’d get there.
strong
"I treat this as an escalation ladder and choose the cheapest thing that could pass an eval, then move up only when there's evidence the simpler approach fails. First: can prompt engineering solve it? If the model knows the domain and the task is stable, a well-crafted system prompt with few-shot examples often gets 80% of the way there. If the knowledge base is under about 200,000 tokens, full-context prompting with caching is often faster and cheaper than building a retrieval pipeline. Second: if the answer depends on private or frequently updated knowledge, like a help center that changes weekly, I'd use RAG. RAG gives me grounded and citable responses, fast updates without retraining, and clear auditability. The risk I'd manage isn't the retrieval algorithm; it's data quality. Stale docs, bad chunking, and missing metadata cause around 40% of RAG failures in production. I'd build an eval that measures retrieval recall before calling it production-ready. Third: fine-tuning only when I need consistent behavior at scale and have enough labeled examples with a stable target. For a support agent, I'd start with RAG, run an eval on retrieval recall and answer groundedness, and revisit fine-tuning only if I see format or tone failures at scale that RAG can't fix. In 2026 the production pattern is often both: RAG for volatile facts, a LoRA adapter for behavior. But I wouldn't build that from day one."
weak
"Fine-tuning gives the best quality, so I'd fine-tune a model on our data." This fails on four counts: it confuses knowledge injection with behavior training; it ignores that fine-tuning on frequently-changing facts means constant retraining at GPU cost; it skips the eval question entirely; and it reads as technical posturing rather than the cost and viability judgment senior PMs are expected to demonstrate.
The PM judgment
Viability here means: what is the cost-to-serve per query, how fast can you iterate when the knowledge or behavior target changes, and what is the maintenance burden on your team? Lovability means: will responses be grounded and citable (user trust and compliance), consistent in format (reliability), and fast enough that users don’t abandon the interaction?
RAG wins when knowledge is volatile and citations build trust. Fine-tuning wins when you need consistent behavior at scale and have stable labeled data. Prompting wins when you haven’t yet proven either is necessary. The PM judgment is: don’t build the expensive thing until a cheaper eval proves you need it. Interviewers at OpenAI and Anthropic are specifically checking whether you reach for the cheapest-thing-that-passes-eval or reflexively escalate to complexity.
Follow-up questions to prepare for
Interviewers at AI-forward companies will probe after your opening answer. Have these ready.
“How many labeled examples do you need to fine-tune?” It depends on the task and the base model, but even a few hundred high-quality examples can shift format behavior meaningfully. The stronger PM answer: define what “enough” looks like in the eval before collecting labels, because label cost compounds fast and you don’t want to discover your target behavior was underspecified after spending on annotation.
“How do you eval a RAG pipeline?” At minimum: retrieval recall (does the right chunk surface for a given query?), answer groundedness (does the generated answer stay within the retrieved context?), and end-to-end answer quality rated by a judge or a held-out human set. Retrieval recall is the first gate. If it’s low, tuning the generator won’t help.
“What would make you escalate from RAG to fine-tuning?” Persistent format failures at scale that prompt engineering can’t fix, or a latency and cost profile where the RAG round-trip is too expensive per query. Both require evidence from a running eval, not intuition.
“What’s the data quality risk with RAG?” Stale documents that didn’t get re-indexed, chunks that split at the wrong boundary (splitting a table header from its rows, for instance), and missing or wrong metadata that causes the retriever to surface irrelevant context. A PM who names these is the one who has shipped this, or studied it carefully enough to sound like they have.