#4. CPU Scheduling & Context Switches - Throughput vs Latency
The One Thing to Remember
Context switches are expensive; minimize them. Every time the CPU switches between tasks, it must save and restore state, flush caches, and lose momentum. High context switch rates indicate contention—too many threads fighting for too few CPUs.
Building on Article 3
In Article 3: File I/O & Durability, you learned how fsync() blocks waiting for disk I/O. When that happens, the OS switches to another process. But here's the question: How does the OS decide which process runs, and what does that switching actually cost?
Understanding CPU scheduling helps you understand why your app might be slow even when CPU isn't busy—and why "adding more threads" often makes things worse.
← Previous: Article 3 - File I/O & Durability
Why This Matters (A Performance Mystery)
I once debugged a service that had terrible performance despite low CPU usage. The mystery: 1000 threads on 4 CPUs, spending 40% of time context switching. Adding more threads made it worse. The fix? Reduce to 8 threads (2× cores) and use async I/O. Performance improved 3x.
This isn't academic knowledge—it's the difference between:
-
Sizing thread pools correctly vs wasting resources
- Understanding context switch cost = you size pools appropriately
- Not understanding = you create 1000 threads, performance degrades
-
Choosing threads vs async I/O
- Understanding scheduling = you know when async wins
- Not understanding = you use threads for everything, hit limits
-
Diagnosing performance problems
- Understanding %sy and context switch rates = you know what to measure
- Not understanding = you blame the database, the network, everything except the actual problem
Quick Win: Check Your Context Switch Rate
Before we dive deeper, let's see what your system is doing:
# Context switches per second (system-wide)
vmstat 1
# Look at 'cs' column - should be < 10,000/sec per core
# Context switches per process
cat /proc/<pid>/status | grep ctxt
# voluntary_ctxt_switches: 1500 ← Waiting for I/O (OK)
# nonvoluntary_ctxt_switches: 200 ← Preempted (might be high)
# CPU usage breakdown
top
# Look at: %sy (system time) - high = too much switching
What to look for:
- Context switches > 50,000/sec per core: Too many threads
- %sy > 30%: Spending too much time in kernel (switching overhead)
- nonvoluntary > voluntary: Too many runnable threads competing
How CPU Scheduling Works
THE CPU'S PERSPECTIVE:
══════════════════════
The CPU can only run ONE thread at a time (per core).
The scheduler decides who gets to run.
Time ──────────────────────────────────────────────────────────►
Core 0: │ Thread A │ Thread B │ Thread A │ Thread C │ Thread A │
├──────────┼──────────┼──────────┼──────────┼──────────┤
↑ ↑ ↑
Context Context Context
Switch Switch Switch
Each switch:
1. Save Thread A's registers, program counter
2. Save Thread A's cache state (TLB, etc.)
3. Load Thread B's state
4. Resume Thread B
Cost: 1.3-5.5 microseconds direct overhead
+ Cache misses for next ~millisecond
Linux CFS Scheduler (Completely Fair Scheduler)
┌─────────────────────────────────────────────────────────────────┐
│ LINUX CFS SCHEDULER │
│ (Completely Fair Scheduler) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Goal: Give each task a fair share of CPU time │
│ │
│ How it works: │
│ • Track "virtual runtime" (vruntime) for each task │
│ • Task that ran least gets to run next │
│ • Higher priority = time counts slower (runs more) │
│ │
│ Run Queue (Red-Black Tree): │
│ │
│ ┌─────────────────┐ │
│ │ vruntime: 50 │ │
│ └────────┬────────┘ │
│ ┌────────┴────────┐ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │vrun: 30 │ │vrun: 80 │ │
│ └──────────┘ └──────────┘ │
│ ▲ │
│ │ │
│ Leftmost = lowest vruntime = runs next │
│ │
└─────────────────────────────────────────────────────────────────┘
CFS replaced the old O(1) scheduler in Linux 2.6.23 (2007). It uses a red-black tree to maintain time-ordered runnable tasks, always picking the task with the smallest vruntime—ensuring fairness across all running processes.
Context Switch Deep Dive
What Happens During a Context Switch
CONTEXT SWITCH BREAKDOWN:
═════════════════════════
1. SAVE STATE (~100-500ns)
├── Save CPU registers (general purpose, floating point)
├── Save program counter (where we were executing)
├── Save stack pointer
└── Save thread-local storage pointer
2. SCHEDULER DECISION (~100-300ns)
├── Update vruntime of current task
├── Find next task to run (tree lookup)
└── Handle priority, affinity constraints
3. RESTORE STATE (~100-500ns)
├── Load new task's registers
├── Load program counter
├── Load stack pointer
└── Switch address space (if different process)
4. CACHE EFFECTS (~1000-10000ns) ← THE REAL COST!
├── TLB flush (if different process)
├── L1/L2 cache now has wrong data
├── Branch predictor state is wrong
└── Prefetch queue is useless
Total: 1.3-5.5 microseconds direct overhead
+ Cache misses for next ~millisecond
Measured overhead: Practical measurements show context switch overhead typically ranges from 1.3 to 5.48 microseconds depending on the system. Modern systems can handle around 600K context switches per second on typical hardware.
The hidden cost: The direct overhead (1-5μs) is small, but cache pollution is the real killer. After a context switch, the new thread's data isn't in cache, causing thousands of cache misses. Each miss costs ~100 cycles. This is why context switches hurt more than the direct overhead suggests.
The Hidden Cost: Cache Pollution
BEFORE SWITCH:
══════════════
Thread A has been running. Cache is "warm":
L1 Cache: [A's data] [A's data] [A's data] [A's data]
L2 Cache: [A's data] [A's data] [A's data] [A's data]
TLB: [A's page mappings]
AFTER SWITCH TO THREAD B:
═════════════════════════
Cache still has A's data, but B needs different data:
L1 Cache: [A's data] [B miss!] [A's data] [B miss!]
└─────────────────────────────────────┘
Cache misses!
Every cache miss: ~100 cycles penalty
Thousands of misses after switch = REAL cost
This is why context switches hurt more than the
1-5 microseconds of direct overhead suggests.
Modern Intel and AMD CPUs use TLB tags to avoid flushing the translation lookaside buffer during context switches (when possible), reducing one historical source of overhead. But cache pollution remains the main cost.
Common Mistakes (I've Made These)
Mistake #1: "More Threads = Better Performance"
Why it's wrong: Beyond CPU core count, you just add context-switching overhead. Each thread consumes memory (stack space, typically 1-8MB), and lock contention increases with thread count.
Real example: I once saw a service with 1000 threads on 4 CPUs. It spent 40% of time context switching. Reducing to 8 threads (2× cores) improved performance 3x.
Right approach:
- CPU-bound: threads = CPU cores
- I/O-bound: threads = cores × (1 + wait_time / compute_time), but cap at reasonable limit (32-64)
Mistake #2: "Context Switches Are Free"
Why it's wrong: Each switch costs 1-5μs direct overhead, plus cache pollution effects that last ~1ms. At high rates (50K+ switches/sec per core), this becomes significant.
Right approach: Monitor context switch rates. If > 50,000/sec per core, you have too many threads competing.
Mistake #3: "Threads Are Always Better Than Async"
Why it's wrong: For I/O-bound work with high concurrency (1000+), async I/O wins because it avoids context switches. Threads are better for CPU-bound work or moderate concurrency.
Right approach: Use async for high-concurrency I/O, threads for CPU work or moderate concurrency.
Trade-offs: Threads vs Async
Thread-per-Request Model
THREAD-PER-REQUEST:
═══════════════════
Request 1 → Thread 1 ────[wait for DB]──── Response
Request 2 → Thread 2 ────[wait for DB]──── Response
Request 3 → Thread 3 ────[wait for DB]──── Response
...
Request 1000 → Thread 1000 ────[wait]──── Response
With 4 CPU cores:
• 1000 threads competing for 4 cores
• ~250 context switches per "round"
• Threads spend most time waiting, not working
• Memory: 1000 threads × 1MB stack = 1GB just for stacks!
Trade-off:
✅ Simple programming model (blocking I/O)
✅ Each request isolated
❌ High memory usage
❌ High context switch overhead
❌ Doesn't scale to 10K+ concurrent requests
Event-Driven / Async Model
ASYNC MODEL:
════════════
Event Loop (1 thread per core)
│
├── Handle Request 1 (start DB query, YIELD)
├── Handle Request 2 (start DB query, YIELD)
├── Handle Request 3 (start DB query, YIELD)
├── DB response for Request 1 ready → complete it
├── Handle Request 4 (start DB query, YIELD)
├── DB response for Request 2 ready → complete it
└── ...
With 4 CPU cores:
• 4 threads, one per core
• Minimal context switches (only at OS scheduler quantum)
• All threads doing useful work
• Memory: 4 threads × 1MB = 4MB for stacks
Trade-off:
✅ Low memory usage
✅ Minimal context switches
✅ Scales to 100K+ concurrent requests
❌ Complex programming model (callbacks/async-await)
❌ One slow handler blocks others (need timeouts)
❌ Harder to debug (stack traces are weird)
The Decision Matrix
┌─────────────────────────────────────────────────────────────────┐
│ THREADS vs ASYNC DECISION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Use THREADS when: │
│ • CPU-bound work (threads = cores) │
│ • Simple blocking I/O with moderate concurrency (<100) │
│ • Team unfamiliar with async patterns │
│ • Using languages without good async support │
│ │
│ Use ASYNC when: │
│ • I/O-bound with high concurrency (1000+) │
│ • WebSockets, long-polling, streaming │
│ • Need to minimize memory usage │
│ • Using Node.js, Go, Rust, Python asyncio │
│ │
│ HYBRID (most common): │
│ • Async for I/O operations │
│ • Thread pool for CPU-heavy work │
│ • Example: Node.js cluster + worker threads │
│ │
└─────────────────────────────────────────────────────────────────┘
Thread Pool Sizing
The Formula
OPTIMAL THREAD POOL SIZE:
═════════════════════════
For CPU-bound tasks:
threads = number_of_cores
For I/O-bound tasks:
threads = number_of_cores × (1 + wait_time / compute_time)
Example:
4 cores, tasks wait 100ms for I/O, compute 10ms
threads = 4 × (1 + 100/10) = 4 × 11 = 44 threads
But! There are limits:
• Memory: Each thread needs stack space (1-8MB)
• Diminishing returns: More threads = more switching
• Amdahl's law: Sequential parts limit scaling
• Practical cap: 32-64 threads for I/O-bound
Practical Guidelines
import os
import concurrent.futures
# CPU-bound work: match cores
cpu_pool = concurrent.futures.ThreadPoolExecutor(
max_workers=os.cpu_count()
)
# I/O-bound work: more threads OK, but cap it
io_pool = concurrent.futures.ThreadPoolExecutor(
max_workers=min(32, os.cpu_count() * 4)
)
# Or use process pool for CPU-bound (avoids GIL in Python)
process_pool = concurrent.futures.ProcessPoolExecutor(
max_workers=os.cpu_count()
)
Real-World Stories
High Context Switch Overhead in Production
Situation: Backend servers handling thousands of concurrent requests can experience significant context switch overhead, especially when processes are frequently blocked waiting for network I/O (Redis, MySQL) or when time slices expire.
The numbers: On typical systems, context switches occur at rates around 600K/sec. Each switch costs 1.3-5.5 microseconds direct overhead, plus cache pollution effects. When you have 1000 threads competing for 4 cores, you get ~250 context switches per "round" of scheduling, and threads spend most time waiting, not working.
Impact: In high-concurrency scenarios, the cumulative overhead from context switches can noticeably impact overall performance. This is why async I/O (Node.js, Go goroutines, Python asyncio) outperforms thread-per-request models for I/O-bound workloads.
References:
- Measuring Context Switching Overheads for Linux Threads
- Understanding Context Switching and Its Impact
Lesson: Monitor context switch rates. If you're seeing > 50,000 switches/sec per core, you have too many threads competing. Consider async I/O or reducing thread pool size.
Node.js: Why Single-Threaded Works
Question: How can Node.js handle 10,000 connections with one thread?
Answer:
- Event loop handles I/O asynchronously (epoll on Linux)
- No context switches between requests (all in one thread)
- All waiting is done by the OS (epoll waits for I/O events)
- Only compute time uses the CPU
The trade-off:
- CPU-heavy work blocks everything (need worker threads)
- But for I/O-bound workloads, this model scales beautifully
Why it works: With async I/O, you have minimal context switches. The event loop thread handles all I/O operations, and the OS (via epoll) notifies when I/O is ready. This is why Node.js can handle 10K+ concurrent connections with a single thread—no context switching overhead.
Lesson: For I/O-bound high-concurrency workloads, async I/O wins because it avoids context switches. Use threads for CPU-bound work or when you need blocking I/O with moderate concurrency.
Monitoring and Debugging
Key Metrics
# Context switches per second (system-wide)
vmstat 1
# r b swpd free ... cs
# 2 0 0 15000 ... 5000 ← 5000 switches/sec
# Context switches per process
cat /proc/<pid>/status | grep ctxt
# voluntary_ctxt_switches: 1500 ← Waiting for I/O (OK)
# nonvoluntary_ctxt_switches: 200 ← Preempted by scheduler
# CPU usage breakdown
top
# Look at: %us (user), %sy (system), %wa (I/O wait)
# High %sy often means too much context switching
# Per-thread CPU usage
top -H -p <pid>
# Shows each thread's CPU consumption
What the Numbers Mean
HEALTHY SYSTEM:
═══════════════
• Context switches: < 10,000/sec per core
• voluntary >> nonvoluntary (waiting for I/O, not preempted)
• %sy < 10% (not spending much time in kernel)
• Load average ≈ number of cores
UNHEALTHY SIGNS:
════════════════
• Context switches > 50,000/sec per core
• nonvoluntary > voluntary (too many runnable threads)
• %sy > 30% (kernel overhead from switching)
• Load average >> number of cores
• Run queue depth >> number of cores
Decision Framework
□ What type of work is it?
→ CPU-bound: threads = cores
→ I/O-bound: consider async or more threads (but cap at 32-64)
□ How many concurrent requests?
→ <100: Thread-per-request is fine
→ 100-10,000: Thread pool with tuning
→ >10,000: Async/event-driven
□ What language/runtime?
→ Python: Async for I/O, multiprocessing for CPU (GIL)
→ Java: Thread pools, consider virtual threads (Java 21+)
→ Go: Goroutines (async under the hood)
→ Node.js: Async by default, workers for CPU
□ What's your latency target?
→ Strict (<10ms): Minimize context switches (use async)
→ Relaxed (>100ms): More flexibility
□ Monitoring shows high %sy?
→ Too many threads, reduce pool size
→ Consider async I/O
Key Takeaways
- Context switches are expensive - 1.3-5.5μs direct cost + cache pollution
- Thread pool size matters - too many = worse performance (diminishing returns)
- CPU-bound: threads = cores - more threads = more switching, no benefit
- I/O-bound: async or sized pool - don't waste threads waiting
- Monitor %sy and cs - early warning signs of too many threads
- Hybrid is common - async I/O + thread pool for CPU work
What's Next
Now that you understand how the OS schedules processes and threads, the next question is: How does data actually move between machines?
In the next article, TCP Deep Dive - Reliability vs Latency, you'll learn:
- How TCP ensures reliable data delivery
- Why TCP has connection states (and how to debug them)
- The trade-offs between reliability and latency
- When to use TCP vs UDP
This connects directly to what you learned here—when a thread blocks waiting for network I/O, the OS switches to another process. Understanding TCP helps you understand those I/O waits.
→ Continue to Article 5: TCP Deep Dive
This article is part of the Backend Engineering Mastery series. Understanding CPU scheduling helps you understand why your app might be slow even when CPU isn't busy.