ai · hard
When does a feature warrant an agent vs. a simpler LLM call?
When does a feature warrant an agent rather than a simpler LLM call?
This question is now standard at AI-native companies in 2026, and it exposes a specific failure mode: candidates who treat “agentic” as a quality signal rather than an architectural tradeoff. The interviewer is not testing whether you can define an agent. They are testing whether you would ship one responsibly.
The 2026 reframe
Before 2025, the agent question had a hidden feasibility gate: can the model actually chain steps reliably? That gate is largely gone. Frontier models handle multi-step workflows well enough that almost any workflow is buildable. The question has shifted entirely to: who bears the cost of a wrong step?
If the user can see and absorb the error (visible, reversible, cheap to fix), a single-shot LLM call is often sufficient. If the system acts before the user can intervene, you need a different standard before you start building. Viable and lovable are now the real constraints: viable means the automation value per loop exceeds cost-per-query at realistic volume; lovable means users trust the output enough that they don’t spend more time reviewing it than they would have spent doing it themselves.
The three-gate framework
Run every candidate feature through three gates in order. If it fails any gate, reach for a single-call LLM with a tight prompt first.
Gate 1: Reversibility. Classify every action the agent would take autonomously:
- Freely reversible: undo is instant and complete (autocomplete you ignore, a draft you delete).
- Compensable: reversible but at a cost (cancel a sent email costs social capital; refund a charge costs money).
- Irreversible: deleted data, published content, executed financial transaction.
Irreversible actions at any blast radius require human confirmation by default. You do not remove that confirmation step until trust-threshold data earns it.
Gate 2: Trust threshold. The agent earns autonomy when its task completion rate (TCR) beats what the user does manually. If a user completes the task manually 90% of the time and your agent completes it correctly 70% of the time, the user is doing QA on a slower version of themselves. That is neither viable nor lovable. Only when TCR exceeds the manual baseline does reducing confirmation steps make product sense.
Gate 3: Cost-per-loop. A single-shot LLM call might cost $0.001. An agent loop with tool calls, retries, and memory reads can cost 10 to 100x that. The automation value per task must clear that bar at the volume you are projecting. If it does not, the simpler call wins on viability even if the agent would work.
The ladder of autonomy
The decision in an interview is not whether to eventually reach full autonomy. It is which rung you start on: suggest, partial-step with human approval, or act within guardrails. Start one rung below where you think you need to be, collect TCR and error-cost data, then earn the next rung.
Real examples
Cursor’s tab autocomplete is a single-shot call: fast, cheap, freely reversible (you ignore it). Cursor’s “fix all errors” is agentic: it makes sequential file edits that can be hard to untangle. That is the right split because the blast radius is different.
Harvey keeps legal drafting as a copilot (a lawyer reviews every word) and uses agents only for research retrieval. A wrong clause in a contract is compensable-to-irreversible. A wrong search result is freely catchable. The reversibility classification determines the architecture.
The anti-pattern: Slack’s AI summarization did not need to be an agent. A single LLM call over the last N messages does the job, is cheaper, faster, and has no multi-step failure surface. Over-agentifying is the mistake interviewers at Glean and OpenAI probe for specifically.
Eval requirements differ
Agents require trajectory evaluation: was each step in the chain correct? A single-call LLM feature needs only output evaluation: was the final answer correct? Trajectory evals are harder to build and maintain. That engineering cost is part of the product decision and belongs in your answer.
strong
"I run three gates before recommending an agent. First: reversibility. If any action the system would take autonomously is irreversible or compensable at scale, I start with human confirmation on that step and only remove it after TCR and error-cost data earns it. Second: trust threshold. The agent earns autonomy when its task completion rate beats what the user does manually, because until that point the user is doing QA on a slower version of themselves. Third: cost-per-loop. Agents burn 10 to 100x the tokens of a single call; the automation value per task has to clear that bar at the volume we are projecting. If a feature passes all three gates, an agent is the right architecture. If it fails any of them, I reach for a single-call LLM with a tight prompt, ship it, and layer in agentic steps only when I have evidence the simpler version is the binding constraint. The failure mode I am most worried about is over-agentifying: turning a Slack summarizer into an agent when a single context-window call does the same job with zero multi-step failure surface. And I'd flag to the interviewer that agents require trajectory evals, not just output evals, which adds engineering cost that belongs in the build decision."
weak
"I'd use an agent when the task is complex and requires multiple steps or tool use, and a simpler model when it's a single-step generation task." This fails because complexity is the only axis. Reversibility and trust are absent. Cost has no mention. It gives the interviewer nothing to probe and demonstrates no product judgment about shipping responsibly. It is the answer someone gives after reading one blog post about agents.
The PM judgment
Interviewers at AI-native companies in 2026 are specifically checking whether candidates understand that agentic is not a quality signal. It is an architectural choice with real cost and risk tradeoffs. The strong answer names the reversibility classification explicitly, operationalizes trust threshold with a concrete proxy (TCR vs. manual baseline), acknowledges the cost-per-loop math, and identifies over-agentifying as the failure mode to avoid. Every element of that answer is something a PM would actually need to defend in a spec review.
For more on the underlying cost math, see LLM unit economics. For the guardrails design question that often follows this one, see design agent guardrails.