ai · hard

"How would you optimize an LLM for retrieval on a given dataset?"

How would you optimize an LLM for retrieval on a given dataset?

Updated Jun 2026 Calibrated to the strong-hire bar

This is a real question in Meta’s AI PM and ML PM interview loops, and most candidates fail it by answering the wrong question. They hear “optimize an LLM” and jump to a technique: “I’d use RAG” or “I’d fine-tune on the dataset.” The interviewer stops listening. What they are testing is whether you can diagnose a system before you prescribe a solution, and whether you understand that in 2026 the dominant failure in retrieval products is not model intelligence. It is data quality, retrieval pipeline quality, and the absence of evals that would tell you which problem you actually have.

Structure a strong answer

Start by scoping. Any retrieval question is two separate questions: what is the dataset, and what is the query distribution? The answer follows from those.

strong

"Before I recommend anything, I want to understand the retrieval task. What kind of data: structured records, long-form documents, conversation history, something else? How frequently does it change? What does a good answer look like, and what is the cost of a wrong one? With that framing, here is how I sequence the decision. First, prompt engineering and a small number of high-quality few-shot examples. It is free to iterate, requires no infrastructure, and is often enough for well-defined query types where the dataset fits in a context window. If the dataset is too large or changes too frequently, I move to RAG. For retrieval, I would specifically ask for hybrid search: BM25 keyword retrieval combined with dense vector semantic retrieval. Either alone underperforms both together, and keyword matching handles exact-term queries that embedding models frequently miss. Then a re-ranking pass with a cross-encoder over the initial top-k results before passing to the generator. This produces large quality jumps and is standard in production pipelines. For chunking, I would use semantic or paragraph-level chunking rather than fixed character splits, because fixed splits break mid-concept and degrade retrieval precision. Chunk size is a real tradeoff: smaller chunks give higher retrieval precision but lose surrounding context. The right choice depends on the document structure. I would only consider fine-tuning if the problem is behavioral: the model needs to output in a specific format, adopt a particular tone, or reason about domain-specific relationships that prompting and retrieval cannot carry. Fine-tuning does not reliably add factual knowledge. It changes how the model thinks, not what it knows. Saying 'fine-tune to add knowledge' is the single most common wrong answer to this question. On evaluation: I would instrument Recall@5 and answer faithfulness from day one. If the model generates fluent wrong answers, that is almost always a retrieval failure, not a model failure. The retrieved documents were wrong or irrelevant, and the LLM generated a confident response from bad context. That distinction tells me whether to fix the pipeline or the model, which are completely different projects. At Meta scale, I also size the cost tradeoff: retrieval adds 100 to 500ms of latency and embedding compute cost. A re-ranker adds another 50 to 100ms and is almost always worth it. A dedicated fine-tuned endpoint costs significantly more and is only justified if behavioral changes are not achievable through the pipeline."

weak

"I would fine-tune the model on our data so it knows our domain." This fails three ways: fine-tuning is expensive and slow to iterate; it requires labeled training data that most teams do not have; and it does not reliably add factual knowledge. The model learns patterns and behaviors, not facts. The fine-tuned knowledge also decays as the dataset changes, so you are on a retraining treadmill. A variant: "RAG is always better because you don't have to retrain." This is not a reasoned answer; it is a slogan. It ignores the cases where retrieval quality is fundamentally limited by corpus quality and chunk structure. It says nothing about how you would know whether it is working. A third variant: a framework-dump that lists RAG components (chunking, embedding, vector DB, re-ranking) without connecting any decision to the specific dataset or query distribution. Interviewers hear this from engineers who read the RAG paper. It fails because it skips diagnosis, skips eval, and skips the cost tradeoff.

The decision sequence, not the decision

The question is a diagnostic sequence, not a selection between three options. Treating it as a pick-one question is the failure mode that eliminates most candidates.

The confirmed hierarchy: (1) prompt engineering and few-shot examples first, because iteration is free and the result is often sufficient; (2) RAG when the dataset is too large or too dynamic for context injection; (3) fine-tuning only when the problem is behavioral, not factual; (4) combined RAG and fine-tuning when both problems exist independently. Each step requires a reason to escalate. “More is better” is not a reason.

Why Meta asks this

Meta’s products run retrieval pipelines at enormous scale with adversarial query distributions: Meta AI assistant, search across Facebook, Instagram, and WhatsApp, internal enterprise search. They are not asking a theoretical question. They are asking whether you understand the boundary between a model problem and a data or pipeline problem, and whether you would catch that distinction before shipping a broken system to a billion users.

The 2026 reframe: it is not a model intelligence problem

Before 2025, “optimize an LLM for retrieval” was partly a model selection question: which model is capable enough to handle my domain? In 2026, frontier models are capable enough for almost any domain. The question has shifted entirely. It is now a data quality and pipeline viability question: is your retrieval corpus accurate, current, and well-structured enough to generate trustworthy answers? And do you have the eval infrastructure to know when it is failing?

The candidates who fail this question in 2026 are still answering the 2023 version of it. They are reaching for model upgrades when the problem is chunking strategy, or recommending fine-tuning when the problem is that the retrieval index is stale.

The silent failure mode

RAG has a failure mode that kills candidates who describe it in the abstract but have never operated it. Retrieval can surface confidently wrong documents. The LLM then generates a fluent, confident, wrong answer. The user has no signal that anything went wrong. This is not a hallucination in the classic sense; the model is faithfully grounding its response in what it retrieved. The retrieval failed, not the model.

A PM must design for this: evals that separately measure retrieval quality (Recall@k, Mean Reciprocal Rank) from generation quality (faithfulness, answer relevance), human-in-the-loop thresholds for low-confidence retrievals, and fallback logic when the retrieved context is empty or below a relevance score threshold. Naming this failure mode and the PM-level response to it is what separates a passing answer from a strong-hire answer.

Technical vocabulary that matters

You do not need to implement these. You need to use them correctly.

  • Hybrid search: BM25 keyword retrieval combined with dense vector semantic retrieval. Default to recommending hybrid, not “just add embeddings.”
  • Re-ranking: a cross-encoder model that re-scores the initial top-k retrieved results before passing them to the generator. Standard in production. A PM who stops at retrieval and skips re-ranking is leaving measurable quality on the table.
  • Chunking strategy: the unit of retrieval. Smaller chunks improve retrieval precision; larger chunks preserve context. Semantic or adaptive chunking outperforms fixed-size splits at ingestion cost. This is a PM-level decision because the cost-quality tradeoff applies.
  • Recall@k: does the correct document appear in the top k retrieved results? This measures the retrieval step.
  • Faithfulness: does the generated answer stay grounded in the retrieved context, rather than drifting into model priors? This measures the generation step.
  • Answer relevance: does the answer address the question? Distinct from faithfulness; a faithful but irrelevant answer fails the user just as completely.

For a deeper look at running evals as a product discipline, see the eval harness for PMs. For the full decision on when RAG, fine-tuning, or prompting is the right call, see RAG, fine-tuning, or prompt: how to choose.

Asked at