1. Is this just a top-k problem? fundamental#
At first glance a leaderboard looks like a top-k problem: given a large set of scores, return the top 100 players. That's one Redis call. Done.
But a real leaderboard is a superset of top-k. The top-k question assumes you only care about the players at the very top. A game leaderboard needs to serve every player — including the one ranked #50,000,000 who wants to know their rank and see the players just above and below them.
That's a fundamentally different problem. It's the difference between a shared static window and a personalized dynamic window unique to each player.
| Top-k | Full leaderboard |
|---|---|
| Same result for everyone | Unique rank per player |
| Easy to cache aggressively | Can't cache personalized views |
| Fixed window at the top | Dynamic window anywhere in ranking |
| O(log N + K) one call | Rank computation is a global property |
The distinction matters for scope sizing: friend leaderboards (50–500 players) and regional leaderboards (a few million) are essentially bounded top-k problems. The global leaderboard at 100M players is where the distributed systems complexity lives.
2. Clarification questions fundamental#
Before designing anything, ask questions that place each leaderboard scope into the right complexity class. The goal isn't to delay — it's to use each answer to visibly change something in your design.
Scope and scale. What leaderboard scopes are needed — global, regional, friend, guild? How many total players? This determines whether you're solving a top-k problem or a distributed rank problem.
Score semantics. Is the leaderboard tracking personal best, current score, or cumulative total? Can scores decrease? Personal best scores that only increase are dramatically simpler — most score updates become no-ops on the leaderboard.
Social graph model. Symmetric friends (mutual, both must agree) or asymmetric follows (one-sided)? Is there a hard cap on friend count? A symmetric model with a cap of 500 keeps the friend leaderboard in top-k territory. An uncapped asymmetric model creates a celebrity problem that requires a different architecture.
Staleness tolerance. Does a player need to see their exact rank immediately after a match, or is a directional signal sufficient — "you moved up ~200,000 spots"? The answer determines whether you need exact rank computation or approximate rank via histograms.
Time windows. Daily, weekly, all-time? Each window is essentially a separate leaderboard. Do windows need to reset instantaneously or can they roll over gradually?
3. Entities and APIs fundamental#
Core entities: Player (player_id, display_name, region_id), Score (player_id, personal_best, last_updated), FriendRelation (player_id, friend_id — symmetric), Leaderboard (scope, time_window, ordered list of player+score — always a derived view, never written to directly).
The score entity is the source of truth. The leaderboard is always computed from scores — never authoritative itself.
Key API endpoints:
POST /scores
body: { player_id, new_score }
→ updates personal best if higher, triggers downstream updates
→ returns: { rank_before, rank_after, delta }
GET /leaderboard/{scope}/{window}?page=&size=
scope: global | regional | friend
window: daily | weekly | alltime
→ returns paginated top-k with ranks
GET /leaderboard/{scope}/{window}/me
→ returns requesting player's rank + nearby players (±5)
GET /leaderboard/friend/{window}
→ returns friend leaderboard for requesting player
4. Foundation: Redis sorted sets fundamental#
A Redis sorted set is a collection where every member has an associated score. Redis keeps members sorted by score automatically at all times. It's the right data structure for leaderboards because it gives you O(log N) writes and O(log N + K) reads for top-k, with rank queries built in.
The key intuition: every time you ZADD, Redis re-inserts the member at the correct position in an internal skiplist. You never sort manually. The tradeoff is it lives entirely in memory.
| Command | What it does | Complexity |
|---|---|---|
ZADD key score member | Add or update a member's score. Returns 1 if new, 0 if updated. | O(log N) |
ZINCRBY key delta member | Atomically increment score. Used for cumulative scoring and guild aggregates. | O(log N) |
ZREVRANK key member | 0-based rank, highest score = 0. Returns nil if not found. | O(log N) |
ZRANGE key 0 99 REV WITHSCORES | Top 100 entries, highest first, with scores. | O(log N + K) |
ZRANGEBYSCORE key min max WITHSCORES | All entries within a score range. Result count M is unpredictable. | O(log N + M) |
ZSCORE key member | A single member's current score. O(1). | O(1) |
ZCARD key | Total member count. O(1). | O(1) |
ZCOUNT key min max | Count members in a score range without fetching them. Uses skiplist span counters — O(log N), not O(M). | O(log N) |
ZREM key member | Remove a member. Used when a player changes regions. | O(log N) |
Single instance limits. Redis is in-memory and single-threaded for writes. One entry ≈ 70–100 bytes, so 100M players ≈ 10GB — fits on a large instance but creates a single point of failure. A single node handles ~100–150k writes/sec unpipelined, ~500k–1M pipelined. Beyond this, you need replicas for read scale and sharding for write scale.
5. Leaderboard scopes fundamental#
Each leaderboard scope is a separate sorted set. The write path stays clean — one Kafka event per score update, different consumers handle different scopes independently.
Global leaderboard. One sorted set covering all players. At 100M players this is where the distributed systems complexity lives — sharding, approximate rank, hot key problems.
Regional leaderboard. One sorted set per region. Scoped top-k. If a player changes region, atomically ZREM from old and ZADD to new. At 10 regions × 10M players each, a single sorted set per region is usually sufficient without sharding.
Guild leaderboard. Two variants with different complexity. Individual ranking within a guild (same as friend leaderboard, bounded by guild size) is simple. Guild aggregate ranking — guilds ranked by sum of member scores — is harder: every individual score update triggers a ZINCRBY on the guild aggregate sorted set. This is a read-modify-write that Redis handles atomically.
Time windows. Each window is a separate sorted set. Daily and weekly keys carry a TTL aligned to the window boundary. At reset, the old key expires and a new one starts accumulating. Before expiry, a background job can snapshot the top-k to cold storage for historical records.
leaderboard:global:alltime
leaderboard:global:weekly ← TTL ~8 days
leaderboard:global:daily ← TTL ~25 hours
leaderboard:region:{id}:alltime
leaderboard:guild:{id}:members
leaderboard:guild:{id}:aggregate
user:{id}:friend_leaderboard ← one per user
6. Friend leaderboard and fan-out advanced#
The friend leaderboard is a personalized view — each player has their own ranked list of friends. The social graph determines which scores to include, but the scores themselves come from the global score store.
Fan-out on write. When a player's score updates, push the new score into each of their friends' cached sorted sets. Reads are O(1) — just serve the pre-built cache. Cost: write amplification proportional to friend count.
Fan-out on read. When a player opens their friend leaderboard, fetch all friends' current scores live. Always accurate, no cache to maintain. Cost: read is O(friends) live lookups. Friend add/remove is a non-event — next read reflects current state automatically.
The celebrity problem. The fan-out decision is made from the writer's perspective based on their friend count. A user with 50 friends fans out to 50 caches — cheap. A celebrity with 10,000 friends would fan out to 10,000 caches on every score update — prohibitive. The hybrid solution:
Normal users (below threshold, e.g. 500 friends) → fan-out on write. Each friend's cached sorted set gets a ZADD on every score update.
Celebrity users (above threshold) → skip fan-out. Their score stays only in the global sorted set, fetched live at read time.
Read path for friend leaderboard. Fetch the friend list from cache (not the graph DB directly — that's the source of truth but not queried on every read). Normal friends' scores come from the per-user cached sorted set. Celebrity friends' scores are batch-fetched via ZSCORE from the global sorted set. Merge and sort in the application layer. Return to client.
Friend add/remove. For fan-out on write users: immediately update both users' cached sorted sets. For celebrity users: the friend list cache is invalidated — next read fetches fresh. The cached sorted set never contains celebrity entries, so there's nothing to remove there.
7. The rank problem advanced#
This is what makes a leaderboard fundamentally different from top-k. To know you're rank #4,821,033 out of 100M players, the system needs to know how many players score higher than you. ZREVRANK does this in O(log N) — which sounds fine, but at 100M players with millions of concurrent rank queries plus constant writes, a single sorted set becomes a hot key.
More importantly, rank is a global property. Once you shard the sorted set, no single node knows the global rank of any player. You need a way to compute global rank across shards efficiently.
The neighbor query compounds this. Finding the 5 players just above and below you requires knowing your exact rank offset first. ZREVRANK + ZRANGE works on a single node but breaks under sharding.
Score-range queries as the solution for neighbors. Instead of querying by rank offset, query by score proximity. Nearby players have nearby scores.
ZRANGEBYSCORE leaderboard (1440) (1460) WITHSCORES
→ O(log N + M) where M = players in that score band
This is naturally shardable — a score range query almost always hits exactly one shard. The tradeoff is M is unpredictable in dense score regions. Fix this with an adaptive score window: estimate density from bucket counts (players per score point), compute δ dynamically so the window reliably returns ~10 results.
8. Score histogram and prefix sum advanced#
The score histogram is a lightweight approximation layer that sits in front of the expensive exact-rank query. It answers "what is my approximate rank?" in O(1) from application memory, with zero Redis calls.
Structure. Divide the score space into buckets. Each bucket stores a count of players in that score range — just an integer, no IDs. For a game with bounded discrete scores (0–10,000 in multiples of 50), you have 200 fixed buckets.
bucket 0-50: 500,000 players
bucket 51-100: 480,000 players
bucket 101-150: 320,000 players
...200 buckets total
Bucket width. Use equal-frequency bucketing — each bucket should contain roughly the same number of players. Game score distributions are uneven (many players in the middle, few at the top). Equal-width buckets give poor rank accuracy in dense regions. Equal-frequency gives uniform error everywhere.
Prefix sum. Precompute a running total over bucket counts. prefix_sum[i] = total players with score above bucket i's lower bound. Approximate rank = single array index read. O(1), served from application memory.
Update path. When a player's score changes buckets, two atomic increments: DECR old_bucket, INCR new_bucket. The prefix sum is recomputed by a background job every few seconds and published atomically as a versioned config to Redis. Application servers cache it locally with a short TTL.
Boundary recomputation. When bucket boundaries need to change (score distribution has shifted), don't migrate existing counts. Use ZCOUNT to recompute counts for each new boundary range directly from the sorted set — O(B × log N) total where B is the number of buckets. At 200 buckets and 100M players, that's 200 × log(100M) ≈ 5,400 operations. Then atomic-swap the new config.
Interpolation within a bucket. If two players are in the same bucket, prefix sum gives them identical ranks. Interpolation fixes this: assume uniform distribution within the bucket and estimate position mathematically. Breaks when players cluster at specific score values (common in games — round numbers, level thresholds). Fine granularity reduces reliance on interpolation.
9. Sharding the global leaderboard advanced#
Range-based sharding by score is the natural fit for leaderboards. Each shard owns a score range and is an independent Redis sorted set on its own node.
shard 1: leaderboard:score:0-2500 → redis-shard-1:6379
shard 2: leaderboard:score:2501-5000 → redis-shard-2:6379
shard 3: leaderboard:score:5001-7500 → redis-shard-3:6379
shard 4: leaderboard:score:7501-10000 → redis-shard-4:6379
Global rank computation. Global rank = sum of ZCARD for all shards above yours + ZREVRANK within your shard. ZCARD is O(1). ZREVRANK is O(log N/shards) on a much smaller dataset. Combined with the prefix sum (which makes this O(1) from memory), this gives exact global rank cheaply.
Shard boundaries alignment. Align shard boundaries with bucket boundaries. This keeps every ZCOUNT query local to one shard, lets background jobs run independently per shard in parallel, and means the global prefix sum is just the concatenation of per-shard bucket arrays in score order.
Cross-boundary score updates. When a player's score crosses a shard boundary (e.g. 2400 → 2600): ZREM from shard 1, ZADD to shard 2. These are two operations on different nodes — wrapped in a Kafka event so both are applied eventually, without requiring distributed transactions.
Routing. The application maintains a routing table mapping score ranges to shard addresses. Routing a score to a shard is a binary search over this table — effectively O(1) for a small number of shards. The routing table is cached in application memory with a long TTL since shard boundaries rarely change.
K-neighbor under sharding. Most queries hit one shard. Edge case: player is within 5 rank positions of a shard boundary. Query the adjacent shard for the shortfall and merge in the application layer. This edge case is rare — millions of players per shard means very few are near the boundary.
10. Write path end-to-end advanced#
The write path has two layers: a synchronous Score Service that handles business logic, and asynchronous Kafka consumers that update Redis.
1. Client POST /scores {player_id, new_score}
2. API Gateway → auth, rate limiting
3. Score Service:
a. HGET player:scores player_id → current personal best
b. if new_score ≤ personal_best: return 200, done
c. snapshot rank_before from prefix_sum[old_bucket]
d. UPDATE PostgreSQL SET personal_best = new_score ← source of truth first
e. publish ScoreUpdated{player_id, old, new} to Kafka
f. snapshot rank_after from prefix_sum[new_bucket]
g. return {rank_before, rank_after, delta} to client
4. Kafka consumers (async):
GlobalLeaderboardConsumer:
ZREM shard_old player_id (if score crossed shard boundary)
ZADD shard_new new_score player_id
DECR buckets:old_bucket
INCR buckets:new_bucket
HSET player:scores player_id new_score
RegionalLeaderboardConsumer:
ZADD leaderboard:region:{id} new_score player_id
FriendLeaderboardConsumer:
for each friend (below fan-out threshold):
ZADD user:{friend_id}:friend_leaderboard new_score player_id
PostgreSQL before Kafka. Always write to the source of truth first. If Kafka publish fails, retry — the data is safe. If Kafka succeeded but PostgreSQL failed, you'd have Redis reflecting a score that isn't durably stored.
Rank delta is optimistic. The delta is computed from the prefix sum before Kafka consumers finish updating Redis. It's approximate by design — prefix sum is already a periodic snapshot. The player gets an immediate response; by the time they open the full leaderboard screen, Redis is updated.
Idempotency. ZADD overwrites existing scores — running it twice with the same score is safe. Kafka consumers use at-least-once delivery, so idempotent writes are essential.
11. Operational concerns advanced#
Cache warming (cold start). When a Redis shard loses data, bootstrap from PostgreSQL: batch SELECT scores within the shard's score range, ZADD them in pipelines of 1,000. Replay Kafka events from the last 24 hours to catch up any writes that arrived during recovery. Kafka retention policy (set to 24h+) is an operational decision that enables this.
Reconciliation. A periodic background job samples N random players from PostgreSQL, compares their personal_best against Redis ZSCORE, and corrects any divergence. This handles the rare case where Kafka publish succeeded but a consumer crash left Redis stale.
Out-of-order submissions. Network delays can cause an older score to arrive after a newer one. Solve with a conditional PostgreSQL update: UPDATE scores SET personal_best = X WHERE player_id = Y AND personal_best < X. The database only updates if the incoming score beats the current best, regardless of arrival order.
Anti-cheat. Never trust the client. The game server — not the client — should compute and submit scores. Add plausibility checks in the Score Service: reject scores above the theoretical maximum, flag statistically anomalous jumps for async review rather than synchronous rejection.
Monitoring. The north star metric is Kafka consumer lag. If lag grows, leaderboard updates are falling behind real-time. Also track: Redis memory usage per shard (detect hot shards early), ZADD latency per shard, leaderboard read p99, prefix sum recomputation lag.
Replication. Each shard: 1 primary (writes only) + 2 replicas (reads only), placed in different availability zones. Redis Sentinel or Redis Cluster handles primary election automatically. During failover, Kafka buffers writes — no score updates are lost.
12. Redis command reference fundamental#
A quick reference of every Redis command used in this design, with input/output and interview context.
| Command | Input → Output | Used for |
|---|---|---|
ZADD key score member |
score (number), member (string) → integer (1=new, 0=updated) | Adding or updating a player's score in any sorted set |
ZINCRBY key delta member |
delta (number), member → string (new score) | Guild aggregate scores — atomic increment |
ZREVRANK key member |
member → integer (0=highest) or nil | Exact rank within a shard. Breaks under cross-shard global rank. |
ZRANGE key 0 99 REV WITHSCORES |
index range → array of (member, score) pairs | Top-k reads. Cache result with short TTL. |
ZRANGEBYSCORE key min max |
score range → array of (member, score). M is unpredictable. | K-neighbor queries by score proximity |
ZSCORE key member |
member → string (score) or nil | Personal best check; batch-fetching celebrity scores |
ZCARD key |
(none) → integer | Shard cardinality for global rank computation. O(1). |
ZCOUNT key min max |
score range → integer (count only, no data) | Recomputing bucket counts after boundary changes. O(log N) not O(M). |
ZREM key member |
member → integer (1=removed, 0=not found) | Region changes, cross-shard score updates |
INCR / DECR key |
(none) → integer (new value). Atomic. | Bucket count updates on every score change |
MULTI / EXEC |
Queued commands → array of results | Atomic batch on a single node. No rollback on partial failure. |
EVAL script keys args |
Lua script → any type | Conditional atomic operations (read-then-write). Preferred over MULTI/EXEC when logic branches. |
Note: Redis returns scores as strings, not numbers — ZSCORE returns "1450", not 1450. Your application layer must parse them.