#13. Sharding Strategies - Query Flexibility vs Scale
The One Thing to Remember
Sharding is about choosing the right partition key. A good shard key distributes data evenly and keeps related data together. A bad shard key causes hotspots or requires cross-shard queries. This decision is hard to change later.
Building on Article 12
In Article 12: CAP Theorem, you learned how distributed systems make trade-offs during partitions. But here's the question: How do you actually split your data across multiple machines?
Sharding is the point where distributed systems get real—and where I've made the most expensive mistakes. A bad shard key decision can cost months of re-sharding work.
← Previous: Article 12 - CAP Theorem
Why This Matters (A Costly Mistake)
I once watched a team re-shard 500TB of data. It took 6 months. The original shard key was wrong—it created hotspots that brought down production. The lesson? Choose your shard key carefully. You can't easily undo this decision.
This isn't academic knowledge—it's the difference between:
-
Scaling successfully vs hitting walls
- Understanding shard keys = you design for growth
- Not understanding = you hit limits, panic, rebuild
-
Fast queries vs slow queries
- Understanding access patterns = you minimize scatter-gather
- Not understanding = every query hits all shards, 100x slower
-
Avoiding re-sharding vs months of migration
- Understanding strategies = you plan ahead
- Not understanding = you re-shard later, waste months
When to Shard (Don't Prematurely!)
DON'T SHARD PREMATURELY:
════════════════════════
Before sharding, try:
1. Query optimization and indexing
2. Vertical scaling (bigger machine)
3. Read replicas
4. Caching layer
5. Archive old data
PostgreSQL can handle:
• Billions of rows
• Terabytes of data
• Thousands of queries/second
SHARD WHEN:
═══════════
• Single machine can't hold all data
• Write throughput exceeds single node
• Geographic distribution required
• Compliance requires data locality
I've seen teams shard when they had 10GB of data. Don't be that team. Sharding adds complexity—only do it when you must.
Sharding Strategies
1. Hash-Based Sharding
HASH-BASED SHARDING
═══════════════════
User ID: 12345
│
▼
hash(12345) = 8a3f...
│
▼
8a3f... mod 4 = 2
│
▼
┌─────────┬─────────┬─────────┬─────────┐
│ Shard 0 │ Shard 1 │ Shard 2 │ Shard 3 │
│ │ │ ████ │ │
└─────────┴─────────┴─────────┴─────────┘
▲
User 12345
goes here
FORMULA: shard_id = hash(key) % num_shards
Trade-offs:
| Pros | Cons |
|---|---|
| ✅ Even distribution | ❌ No range queries |
| ✅ Simple to implement | ❌ Re-sharding is painful |
| ✅ Works for any key type | ❌ Related data may be scattered |
Best for: User data, session data, any evenly-distributed access pattern
2. Range-Based Sharding
RANGE-BASED SHARDING
════════════════════
Timestamp: 2024-03-15
│
▼
Which range?
│
▼
┌─────────────┬─────────────┬─────────────┬─────────────┐
│ Shard 0 │ Shard 1 │ Shard 2 │ Shard 3 │
│ Jan-Mar │ Apr-Jun │ Jul-Sep │ Oct-Dec │
│ 2024 │ 2024 │ 2024 │ 2024 │
│ ████ │ │ │ │
└─────────────┴─────────────┴─────────────┴─────────────┘
▲
March data goes here
Trade-offs:
| Pros | Cons |
|---|---|
| ✅ Efficient range queries | ❌ Hotspots on recent data |
| ✅ Easy to add new shards | ❌ Uneven distribution |
| ✅ Natural for time-series | ❌ Old shards become cold |
Best for: Time-series data, logs, events, append-heavy workloads
3. Consistent Hashing
CONSISTENT HASHING
═══════════════════
Hash Ring (0 to 2^32)
┌──────────────────┐
╱ ╲
│ ┌──S1 (pos 1000) │
│ │ │
S4─────┤ Key:user123 │
(pos 800) │ │ │
│ │ ▼ │
│ │ hash=1200 │────S2 (pos 3000)
│ │ │ │
│ │ Walks clockwise
│ │ to S2 │
╲ └─────S3─────────╱
(pos 5000)
Adding a server:
• Only keys between new server and predecessor move
• Other keys stay put!
Trade-offs:
| Pros | Cons |
|---|---|
| ✅ Minimal data movement on change | ❌ More complex implementation |
| ✅ Easy to add/remove nodes | ❌ Can still have some imbalance |
| ✅ Standard for caches | ❌ Virtual nodes needed for balance |
Best for: Distributed caches, load balancing, dynamic cluster membership
Common Mistakes (I've Made These)
Mistake #1: "Sharding by country is fine"
Why it's wrong: Countries have vastly different user counts. USA might have 50% of users, creating a massive hotspot. I've seen this bring down production.
Right approach: Use hash-based sharding on user_id, or composite key (country, user_id) if you need geographic distribution.
Mistake #2: "We'll figure out the shard key later"
Why it's wrong: Changing shard keys requires re-sharding all data. I've seen 6-month migrations because of this. The shard key affects every query you'll ever write.
Right approach: Think about access patterns BEFORE sharding. What's your most common query? That should drive your shard key.
Mistake #3: "More shards = better performance"
Why it's wrong: More shards mean more scatter-gather queries. If your queries need to hit all shards, more shards = slower queries.
Right approach: Start with fewer shards, add more as needed. Use consistent hashing to minimize data movement.
Choosing a Shard Key: The Critical Questions
1. DISTRIBUTION: Will data be evenly spread?
────────────────────────────────────────
BAD: shard_key = country (USA has 50% of users)
GOOD: shard_key = user_id (evenly distributed)
2. ACCESS PATTERNS: Can queries stay on one shard?
───────────────────────────────────────────────
BAD: shard_key = user_id, but you query by order_id
→ Need to check ALL shards!
GOOD: shard_key = user_id, query by user
→ Single shard query
3. HOTSPOTS: Will some keys be much hotter?
─────────────────────────────────────────
BAD: shard_key = celebrity_user_id (one shard overloaded)
GOOD: shard_key = hash(celebrity_user_id) with rate limiting
The Scatter-Gather Problem
SINGLE-SHARD QUERY (Good):
══════════════════════════
Query: SELECT * FROM orders WHERE user_id = 123
┌─────────┬─────────┬─────────┬─────────┐
│ Shard 0 │ Shard 1 │ Shard 2 │ Shard 3 │
└─────────┴─────────┴────┬────┴─────────┘
│
Query hits
ONE shard
│
▼
Response
Latency: ~10ms
SCATTER-GATHER QUERY (Expensive):
═════════════════════════════════
Query: SELECT * FROM orders WHERE product_id = 456
┌─────────┬─────────┬─────────┬─────────┐
│ Shard 0 │ Shard 1 │ Shard 2 │ Shard 3 │
└────┬────┴────┬────┴────┬────┴────┬────┘
│ │ │ │
Query Query Query Query
│ │ │ │
▼ ▼ ▼ ▼
Response Response Response Response
│ │ │ │
└─────────┴────┬────┴─────────┘
│
Merge
results
│
▼
Final Response
Latency: ~40ms (slowest shard + merge time)
Minimizing scatter-gather:
- Choose shard key based on most common query
- Denormalize data to avoid joins
- Cache frequent scatter-gather results
Real-World Trade-off Stories
Instagram: The 64-Bit ID with Embedded Shard Info
Situation: Instagram needed to generate unique identifiers across multiple sharded PostgreSQL databases while handling massive scale (25+ photos and 90+ likes per second). Traditional database auto-increment IDs don't work when data is inserted into many databases simultaneously.
The challenge:
- Need sortable IDs (by time)
- Need 64-bit format (for smaller indexes, Redis storage)
- Need simplicity (few engineers to maintain)
Solution: Instagram developed a distributed ID generation system combining:
- Timestamp (41 bits) - milliseconds since epoch for temporal sortability
- Shard ID (13 bits) - result of modding user ID by number of logical shards (e.g., 31341 % 2000 = 1341)
- Database autoincrement value (10 bits) - for uniqueness within that shard
The genius part: The shard ID is embedded in the ID itself. When you have an ID, you can determine which shard it belongs to without a lookup table!
Result: Can determine shard from ID without lookup. This became influential, inspiring similar systems like Twitter's Snowflake.
References:
Lesson: Embedding shard information in IDs eliminates lookup overhead. This is a powerful pattern for high-scale systems.
Discord: From MongoDB to Cassandra, Then Hot Partition Problems
Situation: Discord needed to store trillions of messages. They started with MongoDB in 2015, then migrated to Cassandra in 2017 for scalability.
Initial approach (2015):
- MongoDB single replica set
- Compound index on (channel_id, created_at)
- Performance degraded at 100 million messages when data exceeded RAM
Cassandra migration (2017):
- Partitioned messages by channel and time buckets
- Handled billions of messages
The problem (2022):
- Cassandra cluster expanded to 177 nodes storing trillions of messages
- High-maintenance with unpredictable latency
- Frequent on-call incidents
- Hot partition problems: Active channels created disproportionate load compared to low-activity servers
The challenge: Varying server sizes meant some channels had millions of messages while others had thousands. Sharding by server_id created hotspots.
Solution: Composite sharding strategy with (server_id, bucket_id) where bucket = time window. Large servers span multiple buckets, but recent messages stay on the same shard for efficient loading.
References:
Lesson: Composite keys can solve hotspot problems. When you have varying access patterns (large vs small servers), composite sharding distributes load better than single-key sharding.
Re-sharding: The Pain (And How to Avoid It)
Why Re-sharding Hurts
SCENARIO: You have 4 shards, need 8
WITH MODULO HASHING:
════════════════════
hash(key) % 4 = 2 → Shard 2
hash(key) % 8 = 6 → Shard 6
Almost EVERY key changes shards!
Data migration required:
• 4 billion rows
• 6 months of work
• Complex dual-write period
Strategies to Avoid Re-sharding
1. Start with more shards than needed
- Start with 64 shards on 4 servers (16 each)
- Add servers, move whole shards
- No re-hashing needed!
2. Use consistent hashing
- Minimizes key movement
- Use virtual nodes for balance
3. Directory-based sharding
- Move keys individually
- Update lookup table
- Most flexible, most complex
Code Examples
Hash Sharding Implementation
import hashlib
class HashShardRouter:
def __init__(self, num_shards):
self.num_shards = num_shards
def get_shard(self, key):
"""Determine which shard a key belongs to."""
hash_value = int(hashlib.md5(str(key).encode()).hexdigest(), 16)
return hash_value % self.num_shards
# Usage
router = HashShardRouter(num_shards=4)
shard = router.get_shard(user_id=12345) # Returns 0, 1, 2, or 3
Scatter-Gather Query
import asyncio
class ShardedQuery:
def __init__(self, shard_connections):
self.shards = shard_connections
async def scatter_gather(self, query, params):
"""Execute query on all shards, merge results."""
tasks = [self.query_shard(conn, query, params) for conn in self.shards]
results = await asyncio.gather(*tasks)
# Merge results
merged = []
for shard_results in results:
merged.extend(shard_results)
return merged
Decision Framework
□ Do I actually need sharding?
→ Can vertical scaling work? Try that first.
→ Can read replicas help? Often sufficient.
□ What's my most common query?
→ That should drive shard key choice.
□ Will data be evenly distributed?
→ High-cardinality key needed.
→ Watch for natural hotspots.
□ Can I avoid cross-shard queries?
→ Denormalize if needed.
→ Accept eventual consistency for aggregates.
□ How will I handle re-sharding?
→ Start with more shards than needed.
→ Consider consistent hashing.
→ Build migration tooling early.
Key Takeaways
- Delay sharding until you truly need it (PostgreSQL handles billions of rows)
- Shard key is critical - hard to change later (Instagram embedded it in IDs)
- Optimize for common queries - minimize scatter-gather (Discord used composite keys)
- Plan for growth - use consistent hashing or over-provision shards
- Denormalize - avoid cross-shard joins
- Build migration tooling early - you WILL need to re-shard
What's Next
Now that you understand how to split data across machines, the next question is: How do you keep copies of that data for availability and performance?
In the next article, Replication Patterns - Consistency vs Latency, you'll learn:
- Synchronous vs asynchronous replication trade-offs
- Leader-follower vs multi-leader patterns
- How to handle replica lag (this causes most bugs)
- Real-world replication strategies from production systems
This builds on what you learned here—sharding splits data, but replication keeps copies for availability.
→ Continue to Article 14: Replication Patterns
This article is part of the Backend Engineering Mastery series.