25 Jan 2026 9 min read Backend Engineering Mastery

#4. CPU Scheduling & Context Switches - Throughput vs Latency

The One Thing to Remember

Context switches are expensive; minimize them. Every time the CPU switches between tasks, it must save and restore state, flush caches, and lose momentum. High context switch rates indicate contention—too many threads fighting for too few CPUs.

Building on Article 3

In Article 3: File I/O & Durability, you learned how fsync() blocks waiting for disk I/O. When that happens, the OS switches to another process. But here's the question: How does the OS decide which process runs, and what does that switching actually cost?

Understanding CPU scheduling helps you understand why your app might be slow even when CPU isn't busy—and why "adding more threads" often makes things worse.

← Previous: Article 3 - File I/O & Durability

Why This Matters (A Performance Mystery)

I once debugged a service that had terrible performance despite low CPU usage. The mystery: 1000 threads on 4 CPUs, spending 40% of time context switching. Adding more threads made it worse. The fix? Reduce to 8 threads (2× cores) and use async I/O. Performance improved 3x.

This isn't academic knowledge—it's the difference between:

Sizing thread pools correctly vs wasting resources
- Understanding context switch cost = you size pools appropriately
- Not understanding = you create 1000 threads, performance degrades
Choosing threads vs async I/O
- Understanding scheduling = you know when async wins
- Not understanding = you use threads for everything, hit limits
Diagnosing performance problems
- Understanding %sy and context switch rates = you know what to measure
- Not understanding = you blame the database, the network, everything except the actual problem

Quick Win: Check Your Context Switch Rate

Before we dive deeper, let's see what your system is doing:

# Context switches per second (system-wide)
vmstat 1
# Look at 'cs' column - should be < 10,000/sec per core

# Context switches per process
cat /proc/<pid>/status | grep ctxt
# voluntary_ctxt_switches:  1500  ← Waiting for I/O (OK)
# nonvoluntary_ctxt_switches: 200 ← Preempted (might be high)

# CPU usage breakdown
top
# Look at: %sy (system time) - high = too much switching

What to look for:

Context switches > 50,000/sec per core: Too many threads
%sy > 30%: Spending too much time in kernel (switching overhead)
nonvoluntary > voluntary: Too many runnable threads competing

How CPU Scheduling Works

THE CPU'S PERSPECTIVE:
══════════════════════

The CPU can only run ONE thread at a time (per core).
The scheduler decides who gets to run.

Time ──────────────────────────────────────────────────────────►

Core 0: │ Thread A │ Thread B │ Thread A │ Thread C │ Thread A │
        ├──────────┼──────────┼──────────┼──────────┼──────────┤
              ↑          ↑          ↑
         Context    Context    Context
         Switch     Switch     Switch

Each switch:
1. Save Thread A's registers, program counter
2. Save Thread A's cache state (TLB, etc.)
3. Load Thread B's state
4. Resume Thread B

Cost: 1.3-5.5 microseconds direct overhead
      + Cache misses for next ~millisecond

Linux CFS Scheduler (Completely Fair Scheduler)

┌─────────────────────────────────────────────────────────────────┐
│                    LINUX CFS SCHEDULER                          │
│               (Completely Fair Scheduler)                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Goal: Give each task a fair share of CPU time                  │
│                                                                 │
│  How it works:                                                  │
│  • Track "virtual runtime" (vruntime) for each task            │
│  • Task that ran least gets to run next                         │
│  • Higher priority = time counts slower (runs more)             │
│                                                                 │
│  Run Queue (Red-Black Tree):                                    │
│                                                                 │
│           ┌─────────────────┐                                   │
│           │   vruntime: 50  │                                   │
│           └────────┬────────┘                                   │
│           ┌────────┴────────┐                                   │
│           ▼                 ▼                                   │
│    ┌──────────┐      ┌──────────┐                              │
│    │vrun: 30  │      │vrun: 80  │                              │
│    └──────────┘      └──────────┘                              │
│         ▲                                                       │
│         │                                                       │
│    Leftmost = lowest vruntime = runs next                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CFS replaced the old O(1) scheduler in Linux 2.6.23 (2007). It uses a red-black tree to maintain time-ordered runnable tasks, always picking the task with the smallest vruntime—ensuring fairness across all running processes.

Context Switch Deep Dive

What Happens During a Context Switch

CONTEXT SWITCH BREAKDOWN:
═════════════════════════

1. SAVE STATE (~100-500ns)
   ├── Save CPU registers (general purpose, floating point)
   ├── Save program counter (where we were executing)
   ├── Save stack pointer
   └── Save thread-local storage pointer

2. SCHEDULER DECISION (~100-300ns)
   ├── Update vruntime of current task
   ├── Find next task to run (tree lookup)
   └── Handle priority, affinity constraints

3. RESTORE STATE (~100-500ns)
   ├── Load new task's registers
   ├── Load program counter
   ├── Load stack pointer
   └── Switch address space (if different process)

4. CACHE EFFECTS (~1000-10000ns) ← THE REAL COST!
   ├── TLB flush (if different process)
   ├── L1/L2 cache now has wrong data
   ├── Branch predictor state is wrong
   └── Prefetch queue is useless

Total: 1.3-5.5 microseconds direct overhead
       + Cache misses for next ~millisecond

Measured overhead: Practical measurements show context switch overhead typically ranges from 1.3 to 5.48 microseconds depending on the system. Modern systems can handle around 600K context switches per second on typical hardware.

The hidden cost: The direct overhead (1-5μs) is small, but cache pollution is the real killer. After a context switch, the new thread's data isn't in cache, causing thousands of cache misses. Each miss costs ~100 cycles. This is why context switches hurt more than the direct overhead suggests.

The Hidden Cost: Cache Pollution

BEFORE SWITCH:
══════════════

Thread A has been running. Cache is "warm":

L1 Cache: [A's data] [A's data] [A's data] [A's data]
L2 Cache: [A's data] [A's data] [A's data] [A's data]
TLB:      [A's page mappings]


AFTER SWITCH TO THREAD B:
═════════════════════════

Cache still has A's data, but B needs different data:

L1 Cache: [A's data] [B miss!] [A's data] [B miss!]
           └─────────────────────────────────────┘
                    Cache misses!

Every cache miss: ~100 cycles penalty
Thousands of misses after switch = REAL cost

This is why context switches hurt more than the
1-5 microseconds of direct overhead suggests.

Modern Intel and AMD CPUs use TLB tags to avoid flushing the translation lookaside buffer during context switches (when possible), reducing one historical source of overhead. But cache pollution remains the main cost.

Common Mistakes (I've Made These)

Mistake #1: "More Threads = Better Performance"

Why it's wrong: Beyond CPU core count, you just add context-switching overhead. Each thread consumes memory (stack space, typically 1-8MB), and lock contention increases with thread count.

Real example: I once saw a service with 1000 threads on 4 CPUs. It spent 40% of time context switching. Reducing to 8 threads (2× cores) improved performance 3x.

Right approach:

CPU-bound: threads = CPU cores
I/O-bound: threads = cores × (1 + wait_time / compute_time), but cap at reasonable limit (32-64)

Mistake #2: "Context Switches Are Free"

Why it's wrong: Each switch costs 1-5μs direct overhead, plus cache pollution effects that last ~1ms. At high rates (50K+ switches/sec per core), this becomes significant.

Right approach: Monitor context switch rates. If > 50,000/sec per core, you have too many threads competing.

Mistake #3: "Threads Are Always Better Than Async"

Why it's wrong: For I/O-bound work with high concurrency (1000+), async I/O wins because it avoids context switches. Threads are better for CPU-bound work or moderate concurrency.

Right approach: Use async for high-concurrency I/O, threads for CPU work or moderate concurrency.

Trade-offs: Threads vs Async

Thread-per-Request Model

THREAD-PER-REQUEST:
═══════════════════

  Request 1 → Thread 1 ────[wait for DB]──── Response
  Request 2 → Thread 2 ────[wait for DB]──── Response
  Request 3 → Thread 3 ────[wait for DB]──── Response
  ...
  Request 1000 → Thread 1000 ────[wait]──── Response

With 4 CPU cores:
• 1000 threads competing for 4 cores
• ~250 context switches per "round"
• Threads spend most time waiting, not working
• Memory: 1000 threads × 1MB stack = 1GB just for stacks!

Trade-off:
✅ Simple programming model (blocking I/O)
✅ Each request isolated
❌ High memory usage
❌ High context switch overhead
❌ Doesn't scale to 10K+ concurrent requests

Event-Driven / Async Model

ASYNC MODEL:
════════════

  Event Loop (1 thread per core)
       │
       ├── Handle Request 1 (start DB query, YIELD)
       ├── Handle Request 2 (start DB query, YIELD)
       ├── Handle Request 3 (start DB query, YIELD)
       ├── DB response for Request 1 ready → complete it
       ├── Handle Request 4 (start DB query, YIELD)
       ├── DB response for Request 2 ready → complete it
       └── ...

With 4 CPU cores:
• 4 threads, one per core
• Minimal context switches (only at OS scheduler quantum)
• All threads doing useful work
• Memory: 4 threads × 1MB = 4MB for stacks

Trade-off:
✅ Low memory usage
✅ Minimal context switches
✅ Scales to 100K+ concurrent requests
❌ Complex programming model (callbacks/async-await)
❌ One slow handler blocks others (need timeouts)
❌ Harder to debug (stack traces are weird)

The Decision Matrix

┌─────────────────────────────────────────────────────────────────┐
│              THREADS vs ASYNC DECISION                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Use THREADS when:                                              │
│  • CPU-bound work (threads = cores)                             │
│  • Simple blocking I/O with moderate concurrency (<100)         │
│  • Team unfamiliar with async patterns                          │
│  • Using languages without good async support                   │
│                                                                 │
│  Use ASYNC when:                                                │
│  • I/O-bound with high concurrency (1000+)                     │
│  • WebSockets, long-polling, streaming                          │
│  • Need to minimize memory usage                                │
│  • Using Node.js, Go, Rust, Python asyncio                      │
│                                                                 │
│  HYBRID (most common):                                          │
│  • Async for I/O operations                                     │
│  • Thread pool for CPU-heavy work                               │
│  • Example: Node.js cluster + worker threads                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Thread Pool Sizing

The Formula

OPTIMAL THREAD POOL SIZE:
═════════════════════════

For CPU-bound tasks:
  threads = number_of_cores

For I/O-bound tasks:
  threads = number_of_cores × (1 + wait_time / compute_time)

Example:
  4 cores, tasks wait 100ms for I/O, compute 10ms
  threads = 4 × (1 + 100/10) = 4 × 11 = 44 threads

But! There are limits:
  • Memory: Each thread needs stack space (1-8MB)
  • Diminishing returns: More threads = more switching
  • Amdahl's law: Sequential parts limit scaling
  • Practical cap: 32-64 threads for I/O-bound

Practical Guidelines

import os
import concurrent.futures

# CPU-bound work: match cores
cpu_pool = concurrent.futures.ThreadPoolExecutor(
    max_workers=os.cpu_count()
)

# I/O-bound work: more threads OK, but cap it
io_pool = concurrent.futures.ThreadPoolExecutor(
    max_workers=min(32, os.cpu_count() * 4)
)

# Or use process pool for CPU-bound (avoids GIL in Python)
process_pool = concurrent.futures.ProcessPoolExecutor(
    max_workers=os.cpu_count()
)

Real-World Stories

High Context Switch Overhead in Production

Situation: Backend servers handling thousands of concurrent requests can experience significant context switch overhead, especially when processes are frequently blocked waiting for network I/O (Redis, MySQL) or when time slices expire.

The numbers: On typical systems, context switches occur at rates around 600K/sec. Each switch costs 1.3-5.5 microseconds direct overhead, plus cache pollution effects. When you have 1000 threads competing for 4 cores, you get ~250 context switches per "round" of scheduling, and threads spend most time waiting, not working.

Impact: In high-concurrency scenarios, the cumulative overhead from context switches can noticeably impact overall performance. This is why async I/O (Node.js, Go goroutines, Python asyncio) outperforms thread-per-request models for I/O-bound workloads.

References:

Lesson: Monitor context switch rates. If you're seeing > 50,000 switches/sec per core, you have too many threads competing. Consider async I/O or reducing thread pool size.

Node.js: Why Single-Threaded Works

Question: How can Node.js handle 10,000 connections with one thread?

Answer:

Event loop handles I/O asynchronously (epoll on Linux)
No context switches between requests (all in one thread)
All waiting is done by the OS (epoll waits for I/O events)
Only compute time uses the CPU

The trade-off:

CPU-heavy work blocks everything (need worker threads)
But for I/O-bound workloads, this model scales beautifully

Why it works: With async I/O, you have minimal context switches. The event loop thread handles all I/O operations, and the OS (via epoll) notifies when I/O is ready. This is why Node.js can handle 10K+ concurrent connections with a single thread—no context switching overhead.

Lesson: For I/O-bound high-concurrency workloads, async I/O wins because it avoids context switches. Use threads for CPU-bound work or when you need blocking I/O with moderate concurrency.

Monitoring and Debugging

Key Metrics

# Context switches per second (system-wide)
vmstat 1
#  r  b   swpd   free  ...  cs
#  2  0      0  15000  ...  5000  ← 5000 switches/sec

# Context switches per process
cat /proc/<pid>/status | grep ctxt
# voluntary_ctxt_switches:  1500  ← Waiting for I/O (OK)
# nonvoluntary_ctxt_switches: 200 ← Preempted by scheduler

# CPU usage breakdown
top
# Look at: %us (user), %sy (system), %wa (I/O wait)
# High %sy often means too much context switching

# Per-thread CPU usage
top -H -p <pid>
# Shows each thread's CPU consumption

What the Numbers Mean

HEALTHY SYSTEM:
═══════════════
• Context switches: < 10,000/sec per core
• voluntary >> nonvoluntary (waiting for I/O, not preempted)
• %sy < 10% (not spending much time in kernel)
• Load average ≈ number of cores

UNHEALTHY SIGNS:
════════════════
• Context switches > 50,000/sec per core
• nonvoluntary > voluntary (too many runnable threads)
• %sy > 30% (kernel overhead from switching)
• Load average >> number of cores
• Run queue depth >> number of cores

Decision Framework

□ What type of work is it?
  → CPU-bound: threads = cores
  → I/O-bound: consider async or more threads (but cap at 32-64)

□ How many concurrent requests?
  → <100: Thread-per-request is fine
  → 100-10,000: Thread pool with tuning
  → >10,000: Async/event-driven

□ What language/runtime?
  → Python: Async for I/O, multiprocessing for CPU (GIL)
  → Java: Thread pools, consider virtual threads (Java 21+)
  → Go: Goroutines (async under the hood)
  → Node.js: Async by default, workers for CPU

□ What's your latency target?
  → Strict (<10ms): Minimize context switches (use async)
  → Relaxed (>100ms): More flexibility

□ Monitoring shows high %sy?
  → Too many threads, reduce pool size
  → Consider async I/O

Key Takeaways

Context switches are expensive - 1.3-5.5μs direct cost + cache pollution
Thread pool size matters - too many = worse performance (diminishing returns)
CPU-bound: threads = cores - more threads = more switching, no benefit
I/O-bound: async or sized pool - don't waste threads waiting
Monitor %sy and cs - early warning signs of too many threads
Hybrid is common - async I/O + thread pool for CPU work

What's Next

Now that you understand how the OS schedules processes and threads, the next question is: How does data actually move between machines?

In the next article, TCP Deep Dive - Reliability vs Latency, you'll learn:

How TCP ensures reliable data delivery
Why TCP has connection states (and how to debug them)
The trade-offs between reliability and latency
When to use TCP vs UDP

This connects directly to what you learned here—when a thread blocks waiting for network I/O, the OS switches to another process. Understanding TCP helps you understand those I/O waits.

→ Continue to Article 5: TCP Deep Dive

This article is part of the Backend Engineering Mastery series. Understanding CPU scheduling helps you understand why your app might be slow even when CPU isn't busy.