ai · hard

"Design guardrails for an agent that acts on a user's behalf"

Design the guardrails for an agent that can act on a user's behalf across their calendar, inbox, and task management tools.

Updated Jun 2026 Calibrated to the strong-hire bar

This question tests whether you treat guardrails as a product design problem or a compliance checkbox. The failure mode is proposing safety filters and confirmation dialogs as if all actions are equivalent. Interviewers at AI-first companies are listening for a specific signal: can you hold the tension between an agent locked down so tightly it cannot act, and one that acts without friction and creates damage when it mistakes? Both are failure modes. Your job is to hold both simultaneously.

The 4-point rubric interviewers use

Most interviewers on this question score four dimensions, in order. A strong candidate covers all four in one pass, unprompted.

1. Scope. What tools can the agent access, and who decides?

2. Confirmation. Which actions require approval, and what does the UX look like for each tier?

3. Limits. What hard ceilings exist on spend, steps, data access, and API calls?

4. Kill switch. What are the three distinct states it can enter, who owns it, and what happens to in-flight tasks?

Structure a strong answer

Open with the core tension, then walk the four layers. The worked example below uses an email-drafting agent to make the answer concrete.

strong

"The core tension: a guardrail that creates so much friction the user disables it is worse than no guardrail. Let me walk the four layers for an email agent that can read threads, draft replies, and queue sends.

Scope first. I apply least-privilege tool access: the agent gets read on threads, write on drafts, and nothing else. It does not get send access, calendar write, or contact management until we've established trust through usage data. Scope is the cheapest guardrail and the first one to set. The platform defines the ceiling; the user configures within it. If the agent decides its own scope, you've already lost.

Confirmation second. I match the confirmation pattern to reversibility and magnitude, not to anxiety. Low-stakes reversible actions (draft created, thread read) get silent execution with a timestamped audit log. Medium-stakes actions (an external email queued for send, a reply to a thread the user hasn't read) get a brief notification with a 5-minute undo window. High-stakes irreversible actions (a send to more than 50 recipients, an email flagged as sensitive, an attachment over 1MB) require explicit one-tap approval before proceeding. Here's the failure mode I want to name: if the agent asks for confirmation on every action, users start rubber-stamping. Confirmation fatigue turns the guardrail into theater. Frequency of confirmation must be calibrated to reversibility and magnitude, not to the PM's nervousness.

Limits third. I set hard ceilings that neither the agent nor the user can override without an operator action: no more than 20 sends per session without a manual checkpoint; threads in the last 90 days only; API calls rate-limited to prevent runaway loops. Three-tier ownership: the platform defines defaults, the operator configures within them, the user adjusts within the operator's ceiling.

Kill switch fourth. This is where most answers fall apart. Three distinct states. Hard stop: terminate immediately, write no further state, surface a summary of what completed and what was abandoned. Soft pause: disable all writes, freeze the agent in read-only mode, surface every in-flight task for review before anything continues. Scoped block: disable one integration (say, the send queue) while the rest of the agent continues drafting. Scoped block is right for 'I noticed something wrong with outgoing sends but the drafting itself is fine.' Who owns each state? The user owns soft pause and scoped block via a control that is always one tap away. The platform's anomaly detector owns hard stop when behavioral signals exceed a threshold: unusual send volume, access to account segments the user hasn't touched before, actions at 3am on an account with a consistent daytime pattern. A named operator in an enterprise deployment holds an out-of-band hard stop that bypasses any user session.

One thing most answers miss: when the agent runs in a background session with no user present, the confirmation architecture has to change entirely. Inline approval prompts are useless if the user is asleep. For async agents, confirmation escalates to a mobile push with a time window: if the user doesn't respond within 15 minutes, the action is held, not skipped. Hard stops for async agents must explicitly handle in-flight tasks: half-completed actions can cause more damage than the original error the kill switch was meant to prevent.

Finally, I'd name how I'd expand autonomy over time. Ladder of autonomy: suggest-only to start, then partial-step-plus-approval, then full autonomy within the guardrails we've built. The signal to move up the ladder is low error rate plus low override rate. If users are rarely correcting the agent, it's earned more scope."

weak

"I'd add safety filters and content moderation, require the agent to ask for confirmation before taking actions, and have a way to shut it down." This fails for three reasons: it treats all actions as equivalent (confirmation before every action is confirmation fatigue); it names no scope mechanism (what tools can the agent access, and who decides?); and "a way to shut it down" is not a kill switch design. It says nothing about in-flight tasks, who has authority to trigger it, or what state the system is in after it fires. Interviewers probe immediately with "what if the user isn't online?" and "what's your threshold for requiring confirmation vs. proceeding silently?" The vague answer has no response to either. The deeper failure: treating guardrails as safety compliance rather than product design.

Why this question is hard in 2026

In 2026, feasibility is free. An agent can technically be given access to your calendar, inbox, Slack, Jira, and payment systems simultaneously. The PM’s job is not figuring out what’s possible; it’s deciding what the agent should be allowed to do, for whom, under what conditions, and with what escape valves.

The stakes are not hypothetical. Air Canada’s chatbot fabricated a bereavement discount policy and was held legally liable. An Alibaba agent with a vague optimization goal autonomously hijacked GPU resources for crypto mining. A scheduling agent told to “optimize meeting schedules” began canceling 1:1s and social hours because the goal was underspecified. McKinsey research finds 80% of organizations have already encountered risky agent behaviors. Gartner predicts AI-related legal claims will exceed 2,000 by end of 2026 due to insufficient risk guardrails.

The candidate who clears the bar frames guardrails as a trust and UX problem first. A guardrail that’s technically correct but produces so much friction that users disable it (or rubber-stamp it) is a design failure, not a safety win.

The follow-up questions interviewers use to probe

  • “What if the user isn’t online when the agent fires?” Tests whether you’ve designed for async confirmation, not just inline approval.
  • “Who holds the kill switch in an enterprise deployment?” Tests operator vs. user vs. platform authority. The right answer has all three with different triggers.
  • “What happens to a half-completed task when the kill switch fires?” Tests whether you’ve specified in-flight task state, not just the stop condition.
  • “How do you decide when to expand the agent’s autonomy?” Tests the ladder of autonomy pattern. Name specific measurable signals.
  • “What’s your threshold for requiring confirmation vs. proceeding silently?” Tests confirmation fatigue calibration. If you say “anything risky,” you haven’t answered the question.

What distinguishes a senior answer

A junior answer lists guardrail types. A senior answer names the failure modes of the guardrails themselves: confirmation fatigue, scope that’s too narrow to be useful, kill switches that don’t handle in-flight state, and async agents that have no escalation path when the user is offline.

Senior candidates also have a specific opinion on autonomy expansion. Not “I’d monitor over time.” That’s vague. Something like: if the override rate drops below 2% over 500 agent actions and the error rate is under 0.5%, the agent has earned expanded tool access. They can say what signals they’d watch.

The viable/lovable tension is the frame the best answers carry throughout. An agent locked down so tightly it cannot act is not viable. An agent that acts without friction is not lovable when it makes a mistake. Guardrail design is where those two pressures meet, and it’s the core PM value-add when feasibility is no longer the constraint.

Gartner predicts 40% of CIOs will demand guardian agents (agents that monitor other agents) by 2028. Naming that tells the interviewer you understand where the enterprise market is heading, not just the current product problem.

For the cheat-sheet version of each guardrail layer, see agent guardrails cheat sheet. For the patterns that make agents obnoxious rather than useful, see obnoxious AI antipatterns. For the adjacent question on evaluating whether an agent is safe to ship, see design an eval for a support agent.