ai pm · thesis

Build an eval portfolio project (the 48-hour version)

Updated Jun 2026 Calibrated to the strong-hire bar

Almost no PM candidate in 2026 has actually built an eval harness. Interviewers at AI-first companies call it the single highest-signal portfolio artifact they see, and also the rarest. The OpenAI CPO said in 2025: “the most important thing a product manager can learn to do is write evals.” This page gives you the specific, scoped build path to produce one for real.

The 2026 frame first: feasibility is effectively free. Anyone can vibe-code a working AI feature before lunch. The interview question is now whether the output quality justifies what users pay and whether they will come back. An eval project answers that in artifact form. Viability lives in your pass rate thresholds and cost-per-query math. Lovability lives in the edge cases and failure modes you chose to test. A strong eval project proves you think like a PM who ships with confidence, not just speed.

The scoping decision (where most people fail)

Pick one public AI feature that already exists and that you can reach as a user. Do not build something new. Good targets: Perplexity’s answer engine, ChatGPT’s code interpreter, Notion AI’s summarize, GitHub Copilot’s autocomplete, or any AI support agent you can reach via chat widget. The goal is to evaluate an existing feature against a standard you define, not to build novelty.

A strong project scope answers three questions before you write a single golden-set row:

  • What task does the feature claim to do? (The product’s own framing)
  • What does “correct” look like for a real user completing that task? (Behavior definition)
  • What failure mode, if common, would cause a user to stop paying for it? (The viability bar)

That third question is where product judgment shows. Naming the failure mode that breaks retention rather than the one that breaks accuracy is the signal interviewers are looking for.

Build the golden dataset

Fifty rows is the right starting count. More than 50 and you will not finish. Fewer and the results are noise.

Each row needs: an input prompt, your labeled correct or ideal output, a notes field for why you chose it, and columns for the actual model output and score. That is the full schema.

Weight the set deliberately:

  • 60% core scenarios: the bread-and-butter task the feature handles most of the time
  • 25% edge cases: uncommon or ambiguous inputs that reveal where the model gets brittle
  • 15% adversarial inputs: prompts designed to trigger hallucination, refusal, or off-topic output

Most candidates build their sets too easy. The tail is where products actually fail users, and the tail is what interviewers care about. Your curation choices are the PM signal. Choosing the edge case that exposes a real product risk shows judgment. Choosing 50 variations of the happy path shows you do not understand what evals are for.

Golden dataset curation is your primary artifact. It is more important than the model outputs or the scores.

Run it across 2-3 models

Call the same 50 inputs through at least two models (GPT-4o and Claude Sonnet, or the product’s own API versus a cheaper model). A cross-model comparison is far more interesting than a single-model result. Cost at this scale is $10 to $50 total using standard API pricing.

For open-ended outputs, use an LLM-as-judge with a written rubric. One to two sentences per quality dimension, in plain English, describing what “good” looks like. Score each output 1-5 per dimension.

Two concepts to understand and be able to explain:

LLM-as-judge bias. When you use one LLM to score another’s outputs, the judge model tends to favor outputs that match its own style and phrasing. This measurably inflates scores for outputs from the same model family as the judge. The mitigation: use a different model family as your judge than the one being evaluated, and manually spot-check at least 10% of cases where the judge scored 4 or 5. Document that you did this. The spot-check step is what separates a candidate who understands evals from one who has read a blog post about them.

Pass@k. The fraction of k independent model attempts that produce at least one correct answer. A PM should know this because it drives product decisions about retries, latency, and cost. If pass@1 is 60% but pass@3 is 92%, you have a real product choice: run three attempts in parallel (higher cost, lower latency) or in sequence (lower cost, higher latency). That tradeoff is a PM decision, not an engineering one.

Document failure modes, not just wins

This section separates a portfolio project from a homework assignment.

For each failure mode you find, document four things:

  • The failure category (hallucination, refusal, off-topic output, formatting error, factual error)
  • A specific example input that triggered it
  • Your hypothesis for why the model failed
  • What a PM would do about it: prompt change, retrieval layer, model swap, user-facing guardrail, or accept and monitor

A portfolio showing an 18% to 4% hallucination rate reduction after a system prompt change signals system ownership. A portfolio showing only the final 4% without the before is just a number. Show the before, the intervention, and the after. Show a failure mode you investigated but could not fix, and name why. Honest failure documentation is a stronger signal than a clean scorecard.

Failure mode categories worth documenting at PM depth (not engineering depth): false confidence, scope creep (the model answers a question it was not asked), inconsistency across near-identical inputs, and failure to decline on inputs it should not answer.

Publish results with regression tracking

Put the project somewhere linkable: a public Notion doc, a GitHub repo with a README, or a short writeup on your personal site. The format matters less than the content structure. Each portfolio page should include:

  • Problem framing: why does this task need a model at all?
  • Behavior definition: what does correct output mean, precisely?
  • System architecture: which model, which API, what system prompt (if testable)
  • Eval strategy: how you built the golden set, the metrics, and the judge
  • Failure modes and what you did about them
  • Meaningful metrics: before/after, or cross-model comparison
  • Future improvements: what you would test next and why

Include a regression table showing model A versus model B across your dimensions. That table is what an interviewer can point to and probe. It gives them a real artifact to question, which is exactly what you want.

How to talk about this project under pressure

Interviewers at Anthropic and OpenAI expect candidates to name failure modes unprompted. The safety and alignment round at Anthropic is separate and mandatory. At OpenAI it is woven throughout every case. Both expect a specific example ready.

The strong interview narrative sounds like this: “I picked Perplexity’s answer engine and built a 50-row golden dataset weighted toward edge cases and adversarial inputs. I defined three metrics: factual accuracy, hallucination rate on source attribution specifically, and appropriate uncertainty signaling. I ran it across GPT-4o and Claude Sonnet, used Claude as the judge for GPT-4o outputs and vice versa to reduce style bias, and manually spot-checked every output the judge scored above a 4. The most interesting finding was that both models hallucinated source URLs at a similar rate on ambiguous queries, but only one model signaled uncertainty appropriately. That gap is a product decision: do you let the model answer confidently and handle downstream corrections, or do you surface the uncertainty to the user and accept a higher friction rate? I documented both options with their tradeoffs.”

That answer is specific, names the failure mode, frames it as a product decision, and demonstrates you understand the tradeoff between output quality and user experience. Every claim in it is auditable from the artifact you built.

Weak tells to avoid: UI screenshots without system context, generic metrics (DAU/CTR without connection to model behavior), long PRDs, and any “AI was used” statement without specifics on how and what you measured.

Which companies care most about this signal

At AI-first labs (Anthropic, OpenAI, xAI, Cursor, Perplexity), the eval portfolio is the most important artifact in your submission. Interviewers will probe specifics: which model degraded, what the eval metric was, what the business impact was. The probe sounds like: “Tell me about a time your model degraded in production: name the architecture, the eval metrics, and the business impact.”

At enterprise AI shops (Microsoft, Google, Salesforce), the portfolio still matters but you will be asked more about stakeholder alignment and org-level rollout alongside the technical eval detail. For both contexts, the project differentiates on specificity, not polish. A three-page Notion doc with a real regression table beats a formatted slide deck with no numbers.

The underlying frame for your pass rate thresholds should be the viable/lovable question: does this output quality clear the bar for users to pay for it and return to it? Not “does it work” (feasibility is free), but “would someone choose this over doing the task themselves?” Set your thresholds against that question and you will have something worth defending in any interview room.

For the full build guide on the harness mechanics, see Build an eval harness in an afternoon. For the underlying frame on why feasibility no longer differentiates, see Feasibility is free. For the lovability lens that should drive your pass rate thresholds, see Lovable, not just usable.