8 min read

#8. Database Indexes Deep Dive - Read Speed vs Write Speed

The One Thing to Remember

An index is a sorted copy of your data for fast lookups. Like a book's index that points to pages, a database index points to rows. The trade-off: faster reads, slower writes (because the index must be updated too).


Building on Article 7

In Article 7: Load Balancing, you learned how to distribute traffic across servers. But here's the question: How do databases actually find your data efficiently when you query it?

Understanding indexes is the #1 skill for database performance. I've seen 30-second queries drop to 5ms with the right index.

Previous: Article 7 - Load Balancing


Why This Matters (A True Performance Story)

I once debugged an API endpoint that took 2,340ms to respond. The database query was doing a sequential scan on a 10M row table. One index later: 89ms. 96% improvement. Two lines of SQL.

This isn't academic knowledge—it's the difference between:

  • 30-second queries vs 5ms queries

    • Understanding indexes = you know what to index
    • Not understanding = you blame the database, add more servers, nothing helps
  • Write-heavy systems that work vs systems that don't

    • Understanding index cost = you don't over-index
    • Not understanding = you index everything, writes become slow
  • Debugging slow queries in minutes vs days

    • Understanding EXPLAIN = you know what's wrong
    • Not understanding = you guess, add random indexes, make it worse

Quick Win: Find Your Slow Queries

Before we dive deeper, let's see what's slow:

-- PostgreSQL: Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Find tables doing sequential scans (probably need indexes)
SELECT schemaname, relname, seq_scan, idx_scan,
       seq_scan - idx_scan AS too_many_seq_scans
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY too_many_seq_scans DESC;

How Indexes Work: The B-Tree

Most database indexes use B-Trees (Balanced Trees):

                    B-TREE INDEX ON "age"
                    
                         [50]
                        /    \
                   [25,35]   [75,90]
                   /  |  \    /  |  \
                [10] [30] [40] [60] [80] [95]
                  │    │    │    │    │    │
                  ▼    ▼    ▼    ▼    ▼    ▼
              ┌─────────────────────────────────┐
              │         Leaf nodes point to       │
              │         actual row locations     │
              └─────────────────────────────────┘

Finding age = 30:
1. Start at root [50] → 30 < 50, go left
2. At [25,35] → 25 < 30 < 35, go middle
3. At [30] → Found! Follow pointer to row

Steps: O(log n) instead of O(n) table scan
For 1 million rows: ~20 comparisons vs 1,000,000

The math: For 1 million rows, a sequential scan checks all 1,000,000 rows. A B-tree index checks ~20 nodes. That's a 50,000x reduction in comparisons.


Common Mistakes (I've Made These)

Mistake #1: "More indexes = faster queries"

Why it's wrong: Each index must be updated on every write. I've seen systems with 15+ indexes where every INSERT updated all 15 indexes. Write latency was terrible.

Real example: Instagram's user table had 15+ indexes. Every write updated all 15 indexes. They audited usage, removed 6 unused indexes, combined 3 similar ones. Result: 40% improvement in write latency.

Right approach: Index only what you actually query. Monitor index usage, remove unused ones.

Mistake #2: "Indexing every column is safe"

Why it's wrong: Indexes cost storage, memory, and write performance. Low-selectivity columns (like booleans) don't help much but still cost writes.

Right approach: Index high-selectivity columns (many unique values). For low-selectivity, use partial indexes if needed.

Mistake #3: "Column order doesn't matter in composite indexes"

Why it's wrong: Composite indexes follow the "leftmost prefix" rule. You can't use INDEX(country, city, name) for a query on just city.

Right approach: Put most selective column first, but consider query patterns. If you often query by country alone, put it first even if user_id is more selective.


Trade-offs: The Index Decision

Read vs Write Performance

                     NO INDEX              WITH INDEX
                     ════════              ══════════

SELECT * FROM users      Full table scan       Index lookup
WHERE email = '...'      O(n) = slow           O(log n) = fast
                         1M rows = 1M checks   1M rows = 20 checks


INSERT INTO users        Just insert row       Insert row AND
VALUES (...)             O(1)                  update index
                                               O(log n)

UPDATE users             Find row (slow)       Find row (fast)
SET name = '...'         Update row            Update row
WHERE id = 5             O(n) + O(1)           Update index if
                                               indexed column changed

The Trade-off Table

Aspect More Indexes Fewer Indexes
SELECT speed ✅ Faster ❌ Slower
INSERT speed ❌ Slower ✅ Faster
UPDATE speed ❌ Slower (if indexed col) ✅ Faster
DELETE speed ❌ Slower ✅ Faster
Storage ❌ More disk space ✅ Less space
Memory ❌ More RAM for caching ✅ Less RAM

Composite Indexes: Order Matters!

The Leftmost Prefix Rule

-- Composite index on (country, city, name)
CREATE INDEX idx_location ON users(country, city, name);

These queries CAN use the index:

  • WHERE country = 'USA' ✓ Uses index
  • WHERE country = 'USA' AND city = 'NYC' ✓ Uses index
  • WHERE country = 'USA' AND city = 'NYC' AND name = 'Alice' ✓ Uses full index

These queries CANNOT use the index:

  • WHERE city = 'NYC' ✗ Can't skip country!
  • WHERE name = 'Alice' ✗ Can't skip country, city!
  • WHERE city = 'NYC' AND name = 'Alice' ✗ Can't skip country!

Think of it like a phone book sorted by:

  • Last Name → First Name → Middle Name
  • You can't find "all Johns" without knowing the last name first.

Choosing Column Order

Rule: Most selective column FIRST (usually)

Example data:
- country: 50 unique values (low selectivity)
- city: 10,000 unique values (medium)  
- user_id: 1M unique values (high selectivity)

For: WHERE country = ? AND user_id = ?

Option A: INDEX(country, user_id)
  Step 1: Find country='USA' → 20,000 rows
  Step 2: Find user_id=123 → 1 row
  
Option B: INDEX(user_id, country)
  Step 1: Find user_id=123 → 1 row
  Step 2: Verify country='USA' → 1 row
  
Option B is faster! Start with most selective.

EXCEPTION: If you often query by country alone,
Option A might be better for flexibility.

Covering Indexes: Avoid Table Lookups

The problem: Regular indexes require a table lookup to get other columns.

-- Regular index
CREATE INDEX idx_email ON users(email);

-- Query needs name too
SELECT name FROM users WHERE email = 'alice@example.com';

Execution:
1. Find email in index → get row pointer
2. Go to table, fetch row      ← Extra I/O!
3. Return name

The solution: Covering index

-- Covering index (includes all needed columns)
CREATE INDEX idx_email_name ON users(email, name);

-- Same query
SELECT name FROM users WHERE email = 'alice@example.com';

Execution:
1. Find email in index → name is RIGHT THERE!
2. Return name (no table access needed!)

This is an "index-only scan" - much faster!

PostgreSQL INCLUDE syntax (even better):

-- INCLUDE for non-searchable columns
CREATE INDEX idx_email ON users(email) INCLUDE (name, created_at);

-- email is searchable, name/created_at are just along for the ride
-- Smaller index than INDEX(email, name, created_at)

Real-World Trade-off Stories

Real Production Case: 96% Query Improvement

Situation: An API endpoint had 2,340ms response time. The database query was doing sequential scans on users, posts, and likes tables.

Investigation: Using EXPLAIN ANALYZE, they identified three queries performing sequential scans instead of using indexes. The tables had grown beyond the point where sequential scans were acceptable.

Solution: Added three properly designed indexes:

  • Strategic compound indexes that covered multiple query conditions
  • Placed most selective columns first
  • Used covering indexes where possible

Result: Response time dropped from 2,340ms to 89ms—a 96% improvement. This demonstrates that proper indexing can transform database performance.

References:

Lesson: Always run EXPLAIN ANALYZE on slow queries. Sequential scans on large tables are a red flag. Strategic indexes can provide massive performance improvements.

Instagram: Too Many Indexes (Write Performance)

Situation: Instagram's user table had 15+ indexes. Every write (INSERT, UPDATE) had to update all 15 indexes.

Problem:

  • Every write updated 15 indexes
  • Insert latency was too high
  • Followers feature was slow (write-heavy)

Solution:

  • Audit index usage with pg_stat_user_indexes
  • Removed 6 unused indexes (never used in queries)
  • Combined 3 similar indexes into 1 composite index

Result: 40% improvement in write latency. The system could handle more writes because each write had fewer indexes to update.

Lesson: Monitor index usage. Unused indexes are pure overhead—they slow down writes without helping reads. Use pg_stat_user_indexes to find indexes with idx_scan = 0.

Shopify: The Covering Index Win

Situation: Order lookup query was slow despite having an index on customer_id.

Query:

SELECT order_id, status, total 
FROM orders 
WHERE customer_id = ? 
ORDER BY created_at DESC 
LIMIT 10;

Problem: Index on customer_id, but had to hit table for status, total, and created_at. This caused extra I/O for every row.

Solution:

CREATE INDEX idx_customer_orders ON orders(customer_id, created_at DESC) 
INCLUDE (order_id, status, total);

Result: 10x faster (index-only scan). The query never touches the table—all data comes from the index.

Lesson: Covering indexes can provide massive speedups. If your query selects specific columns, include them in the index (or use INCLUDE) to avoid table lookups.


Index Types Comparison

┌─────────────────────────────────────────────────────────────────┐
│                      INDEX TYPES                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  B-TREE (Default)                                               │
│  ════════════════                                               │
│  • Best for: Range queries, equality, sorting                   │
│  • WHERE age > 25 AND age < 50 ✓                               │
│  • ORDER BY created_at ✓                                        │
│  • Used by: PostgreSQL, MySQL, most databases                   │
│                                                                 │
│  HASH INDEX                                                     │
│  ══════════                                                     │
│  • Best for: Exact equality only                                │
│  • WHERE email = 'user@example.com' ✓                          │
│  • WHERE email LIKE '%@gmail%' ✗                               │
│  • Faster than B-Tree for equality, useless for ranges          │
│                                                                 │
│  GIN (Generalized Inverted Index)                              │
│  ════════════════════════════════                               │
│  • Best for: Full-text search, arrays, JSONB                    │
│  • WHERE tags @> ARRAY['python'] ✓                             │
│  • WHERE document @@ 'search query' ✓                          │
│                                                                 │
│  GiST (Generalized Search Tree)                                │
│  ══════════════════════════════                                 │
│  • Best for: Geometric/spatial data                             │
│  • WHERE location <-> point(x,y) < 1000 ✓                      │
│  • PostGIS uses this                                            │
│                                                                 │
│  BRIN (Block Range Index)                                       │
│  ════════════════════════                                       │
│  • Best for: Large tables with natural ordering                 │
│  • Time-series data where created_at is sequential              │
│  • Much smaller than B-Tree, less precise                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Partial Indexes: Index Only What You Need

-- Full index: indexes ALL rows
CREATE INDEX idx_status ON orders(status);
-- Indexes: pending, processing, shipped, delivered, cancelled

-- Partial index: only active orders (what you actually query)
CREATE INDEX idx_active_orders ON orders(status) 
WHERE status IN ('pending', 'processing');

Benefits:
• Smaller index (fewer rows)
• Faster to update
• Better cache efficiency

Use when:
• You only query a subset of data
• Hot/cold data separation
• Soft deletes (WHERE deleted_at IS NULL)

Code Examples

Analyzing Query Plans

-- PostgreSQL: EXPLAIN ANALYZE
EXPLAIN ANALYZE 
SELECT * FROM users WHERE email = 'alice@example.com';

-- Output:
Index Scan using idx_email on users  (cost=0.42..8.44 rows=1)
  Index Cond: (email = 'alice@example.com')
  Actual time: 0.025..0.026 ms
  Rows: 1

-- What to look for:
-- ✓ "Index Scan" or "Index Only Scan" = good
-- ✗ "Seq Scan" on large table = probably needs index
-- ✗ "Bitmap Heap Scan" = index helps but still reading table

-- Without index:
EXPLAIN ANALYZE 
SELECT * FROM users WHERE middle_name = 'Marie';

Seq Scan on users  (cost=0.00..18584.00 rows=100)
  Filter: (middle_name = 'Marie')
  Actual time: 150.025..152.026 ms  ← SLOW!
  Rows Removed by Filter: 999900

Key insight: Always run EXPLAIN ANALYZE first. Without real statistics (run ANALYZE first!), the query planner cannot make optimal decisions. Many databases can log queries exceeding specified time thresholds, helping you spot problematic queries in production.

Creating Indexes Safely

-- WRONG: Locks table, blocks writes!
CREATE INDEX idx_email ON users(email);

-- RIGHT: Create concurrently (PostgreSQL)
CREATE INDEX CONCURRENTLY idx_email ON users(email);
-- Takes longer but doesn't block writes

-- Monitor progress (PostgreSQL 12+)
SELECT * FROM pg_stat_progress_create_index;

Index Anti-Patterns

Anti-Pattern #1: Indexing Every Column

-- DON'T DO THIS
CREATE INDEX idx_col1 ON users(col1);
CREATE INDEX idx_col2 ON users(col2);
CREATE INDEX idx_col3 ON users(col3);
-- ... 20 more indexes

-- Each write updates ALL indexes!
-- Use composite indexes strategically instead

Anti-Pattern #2: Indexing Low-Selectivity Columns

-- BAD: Boolean has only 2 values
CREATE INDEX idx_active ON users(is_active);
-- Index doesn't help much, adds write overhead

-- BETTER: Partial index if you need it
CREATE INDEX idx_inactive ON users(id) WHERE is_active = false;

Anti-Pattern #3: Wrong Column Order

-- Query pattern: WHERE tenant_id = ? AND created_at > ?
-- 1000 tenants, 1M rows per tenant

-- BAD: Low selectivity first
CREATE INDEX idx_bad ON events(created_at, tenant_id);
-- Finds recent events across ALL tenants, then filters

-- GOOD: High selectivity first
CREATE INDEX idx_good ON events(tenant_id, created_at);
-- Finds tenant's events, then filters by date

Decision Framework

□ Is this column in WHERE clauses frequently?
  → Yes and high selectivity: Index it
  → Yes but low selectivity: Maybe partial index
  → No: Don't index

□ Is this a read-heavy or write-heavy table?
  → Read-heavy: More indexes OK
  → Write-heavy: Minimize indexes

□ Do I need to sort by this column?
  → Yes: Consider including in index
  → With other columns: Composite index

□ Am I querying multiple columns together?
  → Yes: Composite index (most selective first)
  → Consider covering index if same SELECT columns

□ Is the table large (>100K rows)?
  → Yes: Indexes more important
  → No: May not need indexes

Memory Trick

"SILK" for index decisions:

  • Selectivity: High = good index candidate
  • Inclusive: Cover SELECT columns to avoid table lookup
  • Leftmost: Composite index order matters
  • Keep minimal: Every index costs writes

Self-Assessment

Before moving on:

  • [ ] Can you explain why B-Tree is O(log n)?
  • [ ] Know when a composite index helps vs doesn't?
  • [ ] Understand covering indexes and index-only scans?
  • [ ] Can read EXPLAIN output and identify problems?
  • [ ] Know how to find unused indexes?

Key Takeaways

  1. Index = faster reads, slower writes - always a trade-off
  2. Column order matters in composite indexes (leftmost prefix rule)
  3. Covering indexes avoid table lookups entirely (10x speedups possible)
  4. Partial indexes for querying subsets of data (smaller, faster)
  5. Monitor usage - remove unused indexes (they slow writes)
  6. Create concurrently in production to avoid locks
  7. Real impact: Proper indexes can provide 96% query improvement (2,340ms → 89ms)

What's Next

Now that you understand how databases find data efficiently, the next question is: How do databases guarantee that your data stays correct when multiple users are modifying it simultaneously?

In the next article, ACID Transactions Explained - Consistency vs Performance, you'll learn:

  • What ACID actually means (not just definitions)
  • How transactions prevent data corruption
  • The trade-offs between consistency and performance
  • When to use different isolation levels

This builds on what you learned here—indexes help you find data fast, but transactions ensure the data you find is correct.

Continue to Article 9: ACID Transactions


This article is part of the Backend Engineering Mastery series. Index knowledge is essential for database performance.