RAG (Retrieval-Augmented Generation)

RAG, retrieval-augmented generation, means fetching relevant documents at query time and injecting them into the model’s context window so the answer is grounded in specific, current, or private knowledge rather than only what the model learned during training. One sentence for the room: you give the model the right pages before it answers, rather than teaching it everything upfront.

In 2026, feasibility is not the question. The PM question is whether this is the right knowledge problem to solve (viable) and whether retrieval surfaces the right context for this specific user rather than generic, semantically-close chunks (lovable). A system that returns the three most similar paragraphs is feasible. One that understands the user’s role and account state to surface the right section is lovable. Retrieval quality is a product metric, not a backend concern.

How it works (end-to-end)

Four steps: (1) your content is split into chunks and converted into vector embeddings; (2) a user query is also embedded and compared against stored vectors to find the most relevant chunks; (3) those chunks are injected into the model’s prompt as context; (4) the model generates a response grounded in the retrieved material.

The seam between steps two and three is where nearly everything goes wrong. Industry analysis in 2026 puts retrieval as the failure point 73% of the time when RAG underperforms, not generation. And 80% of those retrieval failures trace back to ingestion and chunking decisions made before any embedding or LLM is involved.

Chunk size is the highest-leverage and most semi-permanent decision. Moving from 1500-token to 600-token chunks with 50-token overlap is frequently the biggest quality improvement, but the right value is corpus-dependent and must be measured. Changing strategy later requires re-embedding the entire corpus: flag this early. Before committing, build a labeled test set of 50 to 100 query-answer pairs to rank strategies empirically.

When to use RAG (and when not to)

The PM decision ladder: prompting first, then RAG when you need dynamic or proprietary knowledge, then fine-tuning only when the problem is about behavior or tone rather than knowledge. Fine-tuning changes model weights; RAG changes what information the model sees at query time.

RAG is right for unstructured content: docs, emails, PDFs, transcripts. Wrong for structured data like CRM records or inventory, where API tool-calling is cleaner. RAG adds per-query cost: the retrieval step increases context window size for every request. That trade-off belongs in the PM’s model from day one.

The production reality PMs own

Permissions is the hardest PM problem in enterprise RAG. Without chunk-level access control, all users see all retrieved documents. Most implementations secure the application layer but leave retrieval open, which means a junior rep can surface a document their role should not access. This is the detail that separates candidates who have shipped from those who have only read.

Silent failure is the key risk no one monitors. A RAG system does not crash when retrieval quality degrades. It keeps returning plausible-looking answers that grow stale. A v1 system that audits clean at launch can lose 10 to 20 points of retrieval recall within a year with no code changes, because the corpus drifted. Track retrieval recall, answer faithfulness, and latency p95. Without these, hallucination rates drift upward without any alert.

Agentic RAG is the dominant enterprise pattern in 2026. Agents handle query decomposition, parallel retrieval, and synthesis. Multiple failure surfaces, not one pipeline.

Interview expectations

weak

"RAG retrieves documents from a database and gives them to the model so it can answer questions." Describes the mechanism without PM judgment. No trade-offs, no decision criteria, no production awareness.

strong

"RAG solves a specific problem: training data is stale or private knowledge is out of scope. Mechanism: fetch relevant chunks at query time, inject as context. Trade-offs: per-query latency and token cost go up; 73% of RAG failures happen at retrieval, not generation; the hardest production problem is permissions, chunk-level access control is not optional in enterprise. I start with prompting, add RAG for dynamic knowledge, reach for fine-tuning only when the problem is behavior or tone. Metrics: retrieval recall, answer faithfulness, latency p95, and I watch for silent failure where quality degrades without any alert."

For the full decision framework, see RAG vs fine-tuning and choosing between RAG, fine-tuning, and prompting. For measuring whether your RAG system is working, see eval design for PMs.

How it works (end-to-end)

When to use RAG (and when not to)

The production reality PMs own

Interview expectations

Related