Latency vs throughput (definition)

Latency and throughput are in tension, and the interview question is not “do you know the definitions?” It is “which one matters for this product, and what are you giving up?” A candidate who treats them as two independent dials has not understood the constraint.

Latency: time for one request to complete, in milliseconds. Throughput: requests handled per second (or tokens per second for AI inference). The relationship: batching requests improves throughput by filling the pipeline, but each individual request waits for the batch to assemble. A queue does the same. You buy throughput at the cost of per-request latency.

The decision rule

Ask what the SLO says before choosing an approach. That question is the interview move that separates candidates who understand the trade-off from those who memorized it.

Latency-bound: user-facing, synchronous. The user waits on every request. Amazon measured ~1% lost revenue per 100ms of added latency; Google found 100-400ms added to search reduced queries per user by 0.2-0.6%. When the SLO is “p99 under 100ms,” throughput is secondary.
Throughput-bound: async pipelines, batch jobs, log ingestion. No user waits on any individual event. When the SLO is “1B events per day,” per-event latency is irrelevant.

Correct interview opener: “I want to clarify the SLO first. A p99 under 50ms target tells me this is latency-bound on reads. A batch job measured in events-per-day is throughput-bound.”

The URL shortener worked example

Read-to-write ratio is roughly 100:1. The redirect is synchronous and user-facing; slow redirects damage conversion for the destination. The read path is latency-bound.

Redis cache in front of the DB. At 95%+ hit rate, redirect latency drops under 10ms. Cache-aside keeps the implementation simple.
Sharded counters for ID generation handle high write throughput, but lock acquisition adds 5-10ms per write. Acceptable: writes are rare, and the read path is unaffected.
Click analytics on an async queue, not on the redirect path. This is the right place to accept higher per-event latency.

A weak answer names caching and horizontal scaling without explaining the trade-off each introduces or why this system is specifically read-latency-bound.

Percentile framing

p50 is not what drives churn. p99 is the worst-case for 1 in 100 requests; at scale that is significant. When asked for a latency target, respond with a percentile: “p99 under 50ms, p50 around 5-10ms on a warm cache.” Every pipeline’s throughput is bounded by its slowest stage, which is why reducing latency at the bottleneck raises overall throughput.

The 2026 angle: LLM inference

LLM inference adds a second axis. TTFT (time-to-first-token) is the latency equivalent: users perceive delays above 300-500ms in chat as slow. Tokens per second is the throughput equivalent. Increasing batch size boosts GPU utilization and tokens/sec but raises TTFT because each request waits for the batch to fill. Disaggregated inference, separating prefill (TTFT-sensitive) from decode (throughput-sensitive) onto different hardware, is the 2026 solution from NVIDIA GTC.

PMs do not tune GPU configs. They set the SLO that drives the configuration. “Chat needs TTFT under 300ms, so we run smaller batches; background report generation batches aggressively since no user is waiting” is a viable product decision. At the API layer, feasibility is not the question. Cost and latency shape what is lovable.

Strong vs. weak answers

strong

"First I'd clarify the SLO. A URL shortener redirect target of p99 under 50ms tells me the system is latency-bound on reads. Read-to-write ratio is ~100:1, so I'd anchor the design there. Redis cache at 95%+ hit rate puts redirect latency under 10ms. Sharded counters add 5-10ms write latency in exchange for higher write throughput. Acceptable: writes are rare. Click analytics go on an async queue so they never touch the redirect path. For an LLM-powered feature I'd reframe: TTFT under 300ms for chat, batch aggressively for background enrichment since those are never user-blocking."

weak

"Latency is how fast something responds; throughput is how much it can handle. I'd add caching and horizontal scaling." Definitions are correct, everything else is missing: no SLO, no percentile, no reason this system is read-latency-bound, no cost for any technique. "Caching" with no discussion of hit rate or staleness is vocabulary, not a design decision.

For the latency definition standalone, see latency. For throughput as a term, see throughput. For how SLOs connect to product decisions, see metrics.

The decision rule

The URL shortener worked example

Percentile framing

The 2026 angle: LLM inference

Strong vs. weak answers

Related