glossary · ai
Embeddings
Numerical representations of meaning that allow retrieval by semantic similarity rather than exact keyword match.
Embeddings encode meaning as coordinates. When a user types “I want my money back” and your help center article is titled “Refund Policy,” keyword search returns nothing because the words do not overlap. Embedding-based search returns the right article because both phrases map to nearly the same point in vector space. That is the actual gain: retrieval by what users mean, not what they typed.
In 2026, the PM question is not “can we do semantic search?” Spinning up a vector database and calling an embedding API takes an afternoon. The question is whether embedding-based retrieval is the right architecture for this product, and whether the quality gain justifies the cost over a simpler keyword system. That decision-making level is what AI lab interviewers at OpenAI, Anthropic, and Google are actually testing.
How embeddings work (in PM terms)
A model converts text, images, or other inputs into a list of numbers (a vector) where position in high-dimensional space encodes meaning. Similar meaning equals nearby coordinates. The canonical failure case to know for interviews: the word “bank” (financial institution vs. riverbank) can land in ambiguous territory because a single word carries both meanings. Embeddings work on context, so full sentences are far more stable than single tokens.
Vector dimensions in production range from 1,536 (OpenAI text-embedding-3-small) to 3,072 (top commercial models). You pay storage, memory, and compute for every dimension. Similarity scores for useful retrieval land between 0.6 and 0.9 cosine distance. Below 0.6 is noise. Set product alerts at your threshold, not just infra alerts.
The model landscape PMs need to know
Four reference points for 2026, framed as product decisions:
- OpenAI text-embedding-3-large: 64.6 MTEB score, the most widely deployed commercial model. The safest default when your team is already on the OpenAI stack.
- NV-Embed-v2: 72.31 MTEB, best English-only performance in 2026. Worth evaluating if retrieval quality is the primary constraint.
- Qwen3-Embedding-8B: 70.6 MTEB, multilingual leader. The right choice when your product serves multiple languages.
- Google text-embedding-005: $0.006 per million tokens, roughly 30x cheaper than top-tier competitors. If you are embedding millions of documents, this cost delta belongs in your product decision, not just in the infra ticket.
MTEB scores measure retrieval quality across benchmarks. Higher is better, but the gap between 64 and 72 matters less than whether retrieval quality maps to task completion for your specific users. Run evals on your own data before committing.
Indexing: the cost/quality tradeoff PMs own
Two indexing strategies define the cost curve at scale:
HNSW achieves 95-99% recall at sub-millisecond latency for tens of millions of vectors, at 1.5-2x the storage cost of raw vectors. Use it when latency and recall are both critical.
IVF-PQ reduces memory by 10-50x for 100M+ vector datasets at 90-95% recall. Use it when you are operating at scale and can accept slightly lower recall. What retrieval miss rate is acceptable, and what does it cost the user when retrieval fails? That is a PM decision, not an engineering default.
When NOT to use embeddings
This is what separates PM thinking from textbook recall.
Embeddings are wrong for: exact identifiers (error codes, SKUs, account numbers), structured lookups where boolean logic is cleaner, small datasets where keyword search is fast and accurate enough, and any case where a single missed exact match is worse than a semantically close result. “Bank” as a financial institution is a canonical example of where embedding context helps, but exact product codes are the opposite case.
The cost dimension is real. Embedding a million documents at $0.006 per million tokens is cheap. Reranking adds latency. Storing 3,072-dimension vectors for 100M documents at 1.5x storage overhead is an infrastructure cost that needs a PM’s budget read before it becomes a sunk cost.
Hybrid search as the 2026 production default
Pure vector search is not the standard for serious search products. Hybrid search, combining BM25 keyword similarity with vector similarity and a reranking layer, improves retrieval quality by 10-20% over pure vector search. The reason: BM25 wins on exact terms (product names, codes, rare proper nouns); vector search wins on natural language and paraphrase. The reranker reconciles both signals.
Standard RAG chunk size is 300-500 tokens with 10-15% overlap. This affects what gets embedded and what gets retrieved. Changing it later means re-embedding your corpus, so flag it as a semi-permanent architectural decision before the pipeline ships.
Agentic RAG is the dominant production pattern in 2026: the LLM dynamically decides when and how to retrieve, handles query decomposition, and runs parallel retrieval. More failure surfaces, higher impact when it works.
Interview answer
strong
"Embeddings encode meaning as coordinates. When a user types 'I want my money back' and your article is titled 'Refund Policy,' keyword search fails because the words do not overlap. Embedding-based search returns the right result because both phrases land at nearly the same point in vector space. That is the advantage: retrieval by meaning, not exact match. In production this matters for search, recommendations, RAG pipelines, and deduplication. The key PM decision is when to use it: excellent for open-ended natural language queries, wrong for exact identifiers like error codes or SKUs where BM25 wins. In 2026, best production systems use hybrid search, vector similarity plus BM25 plus a reranking layer, which improves quality by 10-20% over pure vector search. To evaluate whether it is working, I track retrieval recall@k, map that to task completion rate, then to downstream business metrics like support deflection or conversion. The follow-on I always ask my team: what is the cost per query at expected volume, and does the quality gain justify it over a simpler keyword system?"
weak
"Embeddings turn words into numbers so computers can understand them." This stops at the mechanism and never connects to a product outcome, a tradeoff, or a decision. It signals the candidate read the Wikipedia entry but has never made a build decision with this technology. Saying "vectors in high-dimensional space" without grounding it in a user problem fails the same way from the other direction: textbook recitation, not PM thinking.
What the follow-up tests
Interviewers who ask “explain embeddings” follow up with: “How would you evaluate whether semantic search is working well for users?” This tests whether you can chain retrieval metrics to user outcomes to business impact. The answer runs: recall@k (is the system finding the right documents?) maps to task completion rate (are users finding what they need?) maps to support ticket reduction or conversion (does retrieval quality produce business value?). A PM who gives that full chain signals they have shipped this, not just studied it.
For the retrieval architecture decision, see RAG vs fine-tuning and feasibility is free. For the full RAG pipeline, see the RAG glossary entry.