system design · hard

Design a rate limiter: PM interview answer

Design a rate limiter.

Updated Jun 2026 Calibrated to the strong-hire bar

Most candidates fail this question not because they pick the wrong algorithm, but because they never leave algorithm land. A rate limiter is a policy decision expressed in code. The PM answer names who is being limited, why, what happens to them when they hit the ceiling, and how the ceiling connects to the business model. Everything else is execution detail.

Structure a strong answer

strong

"Before I touch any algorithm, I want to clarify the purpose and the identity model. Rate limiters serve three goals that are often in tension: protecting infrastructure from overload, enforcing fair use across customers, and monetizing access tiers. Which of these is the primary driver here? The answer changes everything downstream."

On identity: the unit being limited is a product decision. Limiting by IP is easy to implement but trivially gamed with a VPN and is wrong for multi-tenant SaaS where thousands of users share one corporate IP. Limiting by API key is accurate but can hurt legitimate team usage if the key is shared. Limiting by org or account lets you honor the contract: Stripe's 100 requests per second limit is per account key, not per IP, because that is the level of the commercial relationship. Clarify this before drawing any boxes.

On algorithms, the two that matter in production are token bucket and sliding window counter. Token bucket allows bursts up to a configured capacity, then refills at a steady rate. It is the right choice when your customers do legitimately bursty work, which in 2026 means agents and automated pipelines. Sliding window counter is a weighted hybrid of two fixed windows that eliminates the boundary exploit (a customer can double their limit by straddling a window reset) at minimal memory cost. It is the right choice when fairness is the primary goal. Pick one and defend it for the stated context. Do not recite all four algorithms in textbook order; that signals you memorized a blog post, not that you have judgment.

The core engineering problem a PM must be able to narrate: any counter shared across API servers needs a round-trip to Redis on every request, which adds latency. The mitigation is a local counter per server that syncs to Redis periodically. This trades latency for temporary overage tolerance. That tolerance is a product decision: how much overage can the business absorb before it becomes a fairness or infrastructure problem? Name a number. "We accept up to 5% overage for up to 200ms" is a product stance. "We'll figure out the sync interval later" is not.

The customer experience layer is where PM answers separate from SWE answers. When a user hits the limit, three responses are possible. A hard 429 with a Retry-After header is table stakes and the minimum bar. A proactive warning at 80% of the limit is better product design: it treats the developer as a partner, not a rule violator, and gives them time to adjust before the wall. A soft queue (accept the request, execute when capacity frees) works for async workloads and improves perceived reliability, but it masks overload signals and creates a growing debt that can cascade. The right answer depends on the product context; name the tradeoff explicitly.

"The 429 response is a developer experience touchpoint with churn implications. The error message copy, the Retry-After precision, and whether the response body names which limit was hit and when it resets are all PM-level decisions that affect how the API is perceived."

On monetization: limits are a pricing mechanism. OpenAI's tier system explicitly links higher token-per-minute and request-per-minute limits to spend history. Anthropic's Tier 1 starts at 40k tokens per minute for Claude Sonnet. Twilio uses per-endpoint limits that vary by product line. In each case, the limit values are not engineering parameters; they are product decisions about what use cases each tier can support and at what cost to the business.

"2026 AI-specific angle: if the product is an LLM API, the rate limiting unit is tokens, not requests. A single request can consume 100k tokens. Requests-per-minute limits are almost meaningless here; you need tokens-per-minute budgets alongside them, which is what OpenAI and Anthropic both do. The second AI-specific problem is agentic traffic. An autonomous coding agent can legitimately generate 500 API calls in 90 seconds during a build loop. A naive rate limiter flags this as abuse. PM judgment is required to separate burst-legitimate (agent doing real work) from burst-malicious (scraping or DoS). The practical answer is separate limit tiers for human interactive traffic vs. agent/automated traffic, with a way for customers to declare their integration type."

"I would close with the calibration question: how do I know the limits are set correctly? Watch 429 rate by customer segment, not globally. A spike in 429s concentrated in one tier signals the limit is wrong, not that the customer is abusing. Also track the 80th-percentile headroom: if your best customers regularly hit 70% or more of their limit, that limit is a retention risk, not a safety buffer."

weak

Reciting the four canonical algorithms (fixed window, sliding window log, sliding window counter, token bucket) in textbook order, describing the mechanics of each, picking one without clear reasoning for the stated context, then drawing a diagram with Redis in the middle. No clarifying questions about who is being limited or why. No discussion of what happens when a user hits the limit. No connection to pricing tiers, customer experience, or the business model. No mention of what the right limit values are or how you would know they are calibrated correctly. The interviewer hears "this person memorized ByteByteGo." The candidate fails not because they got the algorithm wrong but because they never showed product judgment: the ability to ask who is being limited, for what reason, and what a good outcome looks like for the user and the business.

The PM judgment

The shift that happened in 2026 is relevant here. Rate limiting used to be purely defensive infrastructure: keep the servers up, stop the abusers. Now it is a product surface. LLM APIs price on tokens; agents generate non-human traffic patterns that break smooth-curve assumptions; the 429 error is a developer experience moment with measurable churn implications. Feasibility (the engineering mechanism) is a solved problem. Viability (are the limits set correctly for the business?) and lovability (does the degraded UX treat developers as partners?) are still genuinely hard and distinctly PM work.

The question tests whether you can own the policy layer on top of the mechanism: what the limits should be, for which customer segments, with what response behavior, and how those limits evolve as usage patterns change. See how to price an AI product and the API PM interview guide for the adjacent judgment calls this question connects to.