ai pm · thesis
RAG vs fine-tuning vs prompting: how to choose
The right answer is not a table. It is a decision ladder with specific escalation signals, and interviewers at AI-native companies use this question to separate candidates who understand production constraints from those who memorized a comparison chart.
The ladder: cheapest-first, escalate on failure
Start with prompting. Escalate to RAG when prompting cannot solve the knowledge gap. Escalate to fine-tuning when RAG cannot solve the behavior gap. That order is not a style preference. Deviating from it without a specific justification is a red flag to interviewers, because each step up adds cost, operational complexity, and a new failure mode.
Prompting covers the case where the model already has the knowledge and you need better output shape, consistent persona, or structured reasoning. A well-structured prompt with a few examples gets there in hours with zero infrastructure. In 2026, context windows of 200K to 1M tokens (Claude, Gemini) mean many knowledge retrieval problems that previously required RAG can now be solved with long-context prompting instead. That is the first question to ask: can I just put the documents in the prompt?
RAG covers the knowledge gap: proprietary data the model was not trained on, information that changes frequently, or knowledge bases too large even for a long context. The decisive escalation signal is freshness. If the underlying data updates and the model output must reflect that without retraining, RAG is the answer. RAG lets you update the knowledge base without touching the model. A concrete step before reaching for fine-tuning: rerankers (Cohere Rerank, BGE-Reranker) deliver a 15 to 35% quality lift on RAG retrieval at a fraction of the cost of training anything.
Fine-tuning covers the behavior gap: consistent output format the model will not hold through prompting alone, a specialized reasoning pattern, a brand voice or scoring rubric that must hold across millions of calls. The escalation signal is that prompting gets you 70% of the behavior you need and evals show it regresses on every model update. Fine-tuning is also the margin lever at scale: at 200K or more queries per month, a fine-tuned smaller model can cut inference cost 70 to 90% versus frontier API calls. But it requires a minimum of roughly 500 high-quality labeled examples and ongoing ML engineering capacity. Those are real costs.
Four decision axes
- Knowledge volatility. Data changes frequently or is proprietary: RAG. Data is stable and bounded: long-context prompt or fine-tuning on a snapshot.
- Behavior vs. knowledge gap. The crispest heuristic: RAG fixes what the model knows; fine-tuning fixes how it consistently behaves.
- Cost and operational complexity. Production data from 800+ systems: RAG alone median $28K to implement; fine-tuning alone median $35K; combined median $55K. Fine-tuning inference can cost 6x more per query than base model API calls at low-to-medium volume. That flips at scale.
- Latency and reliability. RAG adds a retrieval step that can fail silently when the vector DB is stale or the retrieval pulls the wrong documents. Fine-tuning fails silently when training distribution drifts from production queries. Both failure modes require monitoring. The PM who names the monitoring requirement earns the hire signal.
Hybrid patterns worth naming
Roughly 60% of 2026 production AI deployments use a combined approach. Two patterns to know:
RAFT (Retrieval-Augmented Fine-Tuning): fine-tune the model to reason over retrieved documents rather than just completing prompts. A legal system using RAFT reduced irrelevant citations from 18% to 4%. The knowledge base stays fresh through retrieval; the reasoning behavior is baked into the weights.
Routing architecture: a lightweight classifier sends routine queries to a fine-tuned Llama 3.3 8B; edge cases go to a frontier model. Both use RAG layers. The result is that you pay frontier prices only for the queries that require frontier reasoning. This is also a PM design decision, not just an engineering one: you have to specify what “routine” means and build evals to catch misrouting.
What the wrong answer sounds like
strong
"I start with prompting. If the model already has the knowledge and I just need better output shape or persona, a well-structured prompt with examples gets there in hours with zero infrastructure. I also check whether a long-context window can handle the knowledge base before adding retrieval at all. I escalate to RAG when the gap is about knowledge: proprietary data, data that changes frequently, or a corpus too large to fit in context. Before I reach for fine-tuning, I try a reranker on the RAG pipeline since that is a 15 to 35% quality lift at much lower cost. I escalate to fine-tuning only when I have a behavior gap prompting cannot fix, 500-plus labeled examples, and ML engineering capacity to maintain it. On cost: fine-tuning looks expensive upfront, but at 200K-plus queries per month, the per-query inference cost can drop 70 to 90% versus frontier APIs. So fine-tuning is the margin lever at scale. A hybrid worth naming: RAFT, which reduced irrelevant citations from 18% to 4% in a legal system. My failure-mode concern with RAG is silent retrieval errors when the vector DB goes stale. For fine-tuning it is distribution drift. Both require ongoing evals, not just shipping."
weak
"It depends on the use case. Fine-tuning gives the best quality but is expensive and slow. RAG is a good middle ground for proprietary data. Prompting is fastest but limited." This recites the cost-benefit table without giving the interviewer any signal about when to actually move from one tier to the next, what evidence would push the decision, or how you would handle failure in production. Interviewers hear this constantly. It reads as a candidate who skimmed a comparison article, not someone who has reasoned through a live architectural choice.
The 2026 frame
In 2026, feasibility is not the question. Any of the three techniques can technically work. The real question is viability: what does it cost to build and operate at your volume, and does the quality lift justify it? And lovability: does the approach handle failure gracefully enough that users trust it when retrieval pulls a stale document or the fine-tuned model hits an out-of-distribution query?
RAG fails silently when retrieval goes wrong. Fine-tuning fails silently when production queries drift from the training distribution. A PM who can describe these failure modes, name the monitoring required, and connect the architectural choice to the unit economics is showing judgment. One who recites the capability ladder is showing vocabulary.
For the unit economics behind the cost-flip at scale, see LLM unit economics. For the broader argument that feasibility is no longer the constraint, see feasibility is free. For the earlier gate of whether a model belongs in the feature at all, see should a model even be here.