system design · hard
"Design Uber's system" (system design PM interview)
Design Uber's ride-matching system.
This question is not an architecture quiz. When an interviewer at Uber, Lyft, or a comparable marketplace company asks a PM to design Uber’s system, they are testing whether you can identify which technical constraints generate the product decisions that determine whether Uber wins or loses a market. Candidates who list Kafka, Redis, and PostgreSQL without ever connecting a component to a user outcome or a marketplace metric fail. So do candidates who ignore the two-sided nature of what they’re designing: every optimization that helps riders has a consequence for drivers, and a system that ignores that tension doesn’t survive at scale.
Scope before designing
State your assumptions first. Are you designing for one city or globally? Real-time rides only, or does this include Eats and Freight (which have different dispatch models)? For this answer, assume: one mature metro, real-time rides, approximately 50,000 active drivers during peak hours.
That number grounds everything. At 50,000 active drivers each sending a GPS ping every 4 seconds, you’re processing roughly 12,500 location writes per second per city. The active driver set for a single metro fits in about 5MB in memory. This is not a big-data problem in terms of storage: the challenge is latency and freshness, not volume. If you describe this as a “massive scale storage problem,” the interviewer will know you’re pattern-matching to a generic system design blog rather than reasoning from the actual constraints.
The five system components (and the PM-owned decision at each)
Name these in order. For each one, state the product decision you own. That is what separates a PM answer from an SWE answer.
1. Location ingestion. Drivers send GPS updates every 4 seconds. The PM decision is the heartbeat threshold: if a driver goes silent, how many seconds before the system marks them unavailable? Set it too short and you create false supply shortages, raising ETAs for everyone. Set it too long and riders get matched to offline drivers, triggering cancellations and hurting trust metrics. This is not an engineering constant: it is a product decision with a measurable outcome in refund rate and match failure rate.
2. Geospatial index. Uber uses H3, its own hexagonal geospatial indexing library (open-sourced in 2018), not generic geohashing. Interviewers from Uber’s engineering team will notice the difference. H3’s hexagonal cells tile more uniformly than rectangular grids, avoiding distortion artifacts at cell boundaries. The PM decision here is search radius: a tighter radius improves ETA quality but reduces the candidate driver pool in low-density areas, hurting coverage. You tune radius dynamically based on local supply density, and you instrument it by tracking match rate and ETA accuracy by H3 cell.
3. Dispatch engine. The matching algorithm selects from candidate drivers within the search radius. Target: match a rider to the nearest available driver in under 1-2 seconds end-to-end, with candidate selection under 100ms. The PM decision is the weighting function: pure proximity optimizes rider ETA but can strand drivers in low-demand zones and hurt earnings equity across the fleet. Weighting by driver rating, expected earnings per hour, and time since last trip allows the platform to balance ETA quality against driver retention. These are product tradeoffs, not engineering defaults.
4. Surge engine. Surge is modeled as fare = base_fare x surge_multiplier, where the multiplier is a function of demand rate divided by supply rate over a rolling 5-minute window, smoothed with an exponential moving average to prevent oscillation. Without the EMA, a sudden demand spike and the resulting supply response create a feedback loop: surge jumps, drivers flood the zone, supply overshoots, surge collapses, drivers leave, the cycle repeats. The PM decision is the multiplier ceiling. A high ceiling maximizes supply response but accelerates rider churn in high-surge moments. A low ceiling protects rider trust but leaves demand unmet. In some jurisdictions, surge caps are a regulatory constraint, not an engineering choice, and a PM who doesn’t know this is missing a real operational dimension.
5. Trip state machine. A trip moves through states: request, matching, driver accepted, en route to pickup, trip started, completed. Idempotency keys prevent double-charging across retries in a distributed system. A PM should know this exists and why it matters: without idempotent payment operations, payment service failures generate duplicate charges, which drive support volume, refund rate, and erode rider trust. Uber’s Schemaless storage layer (a MySQL-based document store, documented in their engineering blog) persists trip state across this machine. The PM decision is cancellation policy timing: who bears the cost of a last-minute cancel, at what window, and how does that policy affect ghost driver incidents and rider conversion rate.
The 2026 AI layer
In 2026, the core matching algorithm is solved infrastructure. Any competent team can build H3 indexing, a Redis location cache, and a Kafka pipeline. The PM question has moved to three areas where AI changes the system’s product properties.
Predictive positioning. ML models predict demand spikes by zone before they happen, using historical patterns, events, weather, and time-of-day signals. The platform can reposition drivers proactively, reducing reliance on reactive surge. The PM decision is the confidence threshold for a positioning nudge: too aggressive and you move drivers away from actual demand; too conservative and you’re back to surge-reactive behavior. There’s also a consent question: what do you owe drivers for complying with proactive repositioning? That is a product policy, not an algorithm parameter.
AV fleet dispatch. Autonomous vehicles in select metros respond to demand with no surge elasticity since there is no driver making an economic decision. AV supply is a fixed capacity allocation problem. Mixed fleets require a dispatch layer that handles two supply types with different reliability SLAs, different failure modes, and different cost curves. A system design answer that treats all supply as human-driver supply is already dated.
AI-assisted ETA confidence. Models now surface confidence intervals to riders (“2-4 min”) rather than point estimates (“3 min”). The PM decision is how to display uncertainty without increasing rider anxiety or decreasing conversion. This is a lovable question disguised as an infrastructure question.
Structure a strong answer
strong
"Before I design anything, I want to flag how I'll use this time: I'll walk through the five system components, but at each one I'll name the product decision I own and what metric it moves. That's the different thing a PM should bring here versus an SWE. Scope: one mature metro, real-time rides, not Eats or Freight. About 50,000 active drivers at peak, which is 12,500 location writes per second. The active driver set is about 5MB in memory. The challenge is latency and freshness, not storage. The five components: location ingestion, geospatial index using H3 not generic geohashing, dispatch engine, surge engine, and trip state machine. On location ingestion, my decision is the heartbeat threshold. Too short creates false supply shortages. Too long and riders match to offline drivers, which hits trust and refund rate. On the geospatial index, I'm naming H3 specifically because Uber's hexagonal cells tile more uniformly than rectangular grids and avoid boundary distortion. My PM decision is search radius, tuned by market maturity: wider in markets below the density threshold where ETAs improve non-linearly, tighter in mature markets. On dispatch, the target is candidate selection under 100ms and full match under 1-2 seconds. The weighting function is a product policy: proximity-only optimizes ETA but systematically underserves drivers in outer zones, which erodes supply. I'd weight by earnings equity alongside proximity. On surge: fare equals base fare times a multiplier from a demand-over-supply ratio, smoothed with exponential moving average to prevent oscillation. My decisions are the multiplier ceiling, which has regulatory dimensions in some markets, and the smoothing window length. On the 2026 layer: AI positioning prediction shifts the question from reactive surge to proactive repositioning, with a consent and compensation design problem attached. AV-mixed fleets need separate dispatch queues because AV supply has no surge response. And AI-assisted ETAs surface confidence intervals to riders, which is a product display decision. Failure modes I own: ghost driver policy is a trust and refund rate problem, surge oscillation is a rider churn problem in high-demand moments, and cold-start in new markets is a supply subsidy problem that no system design alone fixes. The marketplace flywheel runs through driver density to ETA to rider conversion to trip volume to driver earnings back to supply. The load-bearing lever is different in a new market versus a mature one: in new, widen radius and subsidize supply; in mature, tighten matching and protect earnings equity."
weak
"I'd build a backend with Kafka for real-time events, Redis for driver locations using GEORADIUS, and a PostgreSQL database for trip state. The matching algorithm finds the nearest available driver and sends them the request. Surge pricing increases rates when demand is high." This fails on four counts. It connects no component to a product decision or a metric. It uses generic GEORADIUS language instead of naming H3, signaling a generic blog rather than actual Uber engineering material. It treats surge as a formula without mentioning rider churn, driver trust, regulatory caps, or oscillation risk. And it has no marketplace awareness: no driver supply retention, no two-sided dynamics, no 2026 AI or AV context, no failure modes stated as product problems. Interviewers report this as the most common PM failure in system design rounds: technically plausible, zero PM signal.
What the interviewer is actually probing
- PM signal vs. SWE signal: the interviewer wants product decisions and metrics at each system layer, not a component inventory.
- Marketplace dynamics literacy: driver incentives, rider elasticity, and geographic network effects interact. A system design answer that ignores driver economics misses the core of what makes Uber hard.
- Failure mode ownership: ghost drivers, surge oscillation, and cold-start in new markets are product failures with product responses, not just engineering bugs.
- 2026 awareness: an answer with no mention of predictive positioning, AV fleets, or AI-assisted ETA reads as pre-2024 preparation.
The Uber system design question is a marketplace strategy question using infrastructure vocabulary. A strong answer treats every system component as a product decision record. The viable question is whether the dispatch system can sustain driver supply profitably across diverse markets, including AV-mixed fleets. The lovable question is whether matching feels fast and trustworthy to both sides of the market, not just technically correct.
For the broader product sense lens that applies to marketplace decisions, see feasibility is free. For the adjacent estimation question that grounds these numbers in a real interview prompt, see estimate Uber rides per day in NYC.
Related
- "Estimate Uber rides per day in NYC" estimation
- How would you measure success for Uber Eats? execution
- "Design Twitter's feed system" (system design PM interview) system-design