5 min read

#15. Consensus & Raft - Availability vs Strong Consistency

The One Thing to Remember

Consensus is about getting multiple nodes to agree on a single value, even when some nodes fail. Raft solves this with a strong leader that sequences all decisions. If you understand Raft, you understand etcd, Consul, and most coordination systems.


Building on Article 14

In Article 14: Replication Patterns, you learned how to keep copies of data. But here's the question: How do distributed systems actually agree on a single value when nodes can fail?

Consensus powers critical infrastructure—Kubernetes, Kafka, service discovery. Understanding it helps you debug coordination failures and design highly available systems.

Previous: Article 14 - Replication Patterns


Why This Matters (An etcd Outage Story)

I once debugged a Kubernetes cluster that became unresponsive. The mystery: etcd leader node was abruptly terminated during an upgrade. The issue manifested as stuck TCP connections, causing "context deadline exceeded" errors. The fix? Restart affected distributors. But it took 11 minutes to detect—SLO-based alerting was too slow.

This isn't academic knowledge—it's the difference between:

  • Debugging coordination failures in hours vs days

    • Understanding Raft = you know what to check
    • Not understanding = you blame the network, the nodes, everything
  • Designing highly available systems

    • Understanding consensus = you know what guarantees you're getting
    • Not understanding = you assume availability, get consistency instead
  • Knowing when to use consensus vs when not to

    • Understanding trade-offs = you choose the right tool
    • Not understanding = you use etcd for everything, hit limits

The Consensus Problem

THE CHALLENGE:
══════════════

Three servers need to agree on who is the leader.

  Server A: "I think A is leader"
  Server B: "I think B is leader"  
  Server C: "I think A is leader"

  Network partition!
  
  Server A: "Still A, right?"
  Server B: "Can't reach anyone... B is leader?"
  Server C: "A is down... C is leader?"

Now we have THREE leaders = SPLIT BRAIN = DISASTER

REQUIREMENTS FOR CONSENSUS:
═══════════════════════════

1. AGREEMENT: All non-faulty nodes decide the same value
2. VALIDITY: The decided value was proposed by some node
3. TERMINATION: All non-faulty nodes eventually decide
4. INTEGRITY: Each node decides at most once

Solving this correctly is HARD. That's why we use Raft.

Raft Overview

The Three States

┌─────────────────────────────────────────────────────────────────┐
│                      RAFT NODE STATES                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│    ┌──────────────┐      Wins election      ┌──────────────┐   │
│    │   FOLLOWER   │ ───────────────────────► │   LEADER     │   │
│    │              │                          │              │   │
│    │  • Passive   │                          │  • Active    │   │
│    │  • Responds  │                          │  • Sends     │   │
│    │    to RPCs   │ ◄─────────────────────── │    heartbeat │   │
│    │              │      Discovers leader    │  • Handles   │   │
│    └──────┬───────┘      with higher term    │    all       │   │
│           │                                  │    client    │   │
│           │ Election timeout                 │    requests  │   │
│           │ (no heartbeat)                   └──────────────┘   │
│           ▼                                         ▲           │
│    ┌──────────────┐                                 │           │
│    │  CANDIDATE   │ ────────────────────────────────┘           │
│    │              │         Gets majority votes                 │
│    │  • Requests  │                                             │
│    │    votes     │                                             │
│    │  • May become│                                             │
│    │    leader    │                                             │
│    └──────────────┘                                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Raft Algorithm (Simplified)

NORMAL OPERATION (Leader exists):
═════════════════════════════════

1. Client sends request to Leader
2. Leader appends to its log
3. Leader replicates to Followers
4. Once majority acknowledge, Leader commits
5. Leader responds to client

   Client                Leader              Followers
      │                    │                    │
      │──── Request ──────►│                    │
      │                    │── AppendEntries ──►│
      │                    │                    │
      │                    │◄── ACK ───────────│
      │                    │  (majority)        │
      │                    │                    │
      │                    │ COMMITTED!         │
      │◄── Response ───────│                    │

Common Mistakes (I've Made These)

Mistake #1: "Raft gives us high availability"

Why it's wrong: Raft is CP (Consistency over Availability). During a network partition, the minority side cannot elect a leader and stops accepting writes. This is by design—it prevents split-brain.

Right approach: Understand that Raft sacrifices availability for consistency. If you need high availability during partitions, use an AP system (eventual consistency).

Mistake #2: "More nodes = better performance"

Why it's wrong: More nodes mean higher write latency (need more acknowledgments). A 5-node cluster has higher latency than a 3-node cluster.

Right approach: Use 3 or 5 nodes (typical). More nodes only if you need more fault tolerance (can tolerate 2 failures with 5 nodes).

Mistake #3: "Raft is for all data"

Why it's wrong: Raft is a bottleneck—all writes go through the leader. For large data volumes or high throughput, use sharded databases with Raft per shard.

Right approach: Use Raft for small, critical data (configuration, metadata). Use sharded databases for large data volumes.


Trade-offs in Raft

Availability vs Consistency

RAFT IS CP (Consistency over Availability):
══════════════════════════════════════════

Network partition:
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  Majority side (A, B, C)      │    Minority side (D, E)        │
│                               │                                 │
│  Can elect leader ✓           │    Cannot elect leader ✗       │
│  Can commit writes ✓          │    Reads may be stale          │
│                               │    Writes rejected             │
│                               │                                 │
└─────────────────────────────────────────────────────────────────┘

Minority partition:
• CAN still read (stale data)
• CANNOT write (no quorum)
• Protects consistency!

Performance Trade-offs

WRITE LATENCY:
══════════════

Client ──► Leader ──► Majority Followers ──► Client
                 │              │
                 └──────────────┘
              Network round-trip required!

CLUSTER SIZE:
═════════════

 Nodes │ Majority │ Tolerates │ Write Latency
═══════╪══════════╪═══════════╪══════════════
   3   │    2     │  1 fail   │  Lowest
   5   │    3     │  2 fails  │  Medium
   7   │    4     │  3 fails  │  Higher

More nodes = more fault tolerance
More nodes = higher write latency
Common choice: 3 or 5 nodes

Real-World Trade-off Stories

Grafana Cloud: etcd Outage During Upgrade

Situation: A 12-minute partial outage occurred in Grafana Cloud's Hosted Prometheus service when an etcd leader node was abruptly terminated during a Kubernetes version upgrade.

What happened:

  • The issue manifested as stuck TCP connections from Cortex distributors to the old etcd master
  • This caused "context deadline exceeded" errors
  • Some customers couldn't write metrics
  • The problem wasn't detected immediately—SLO-based alerting took 11 minutes to trigger

The fix: Restarting the affected distributors resolved the issue, but the detection delay was problematic.

Root cause: Bad etcd client setup. The distributors were holding connections to the old etcd leader even after it was terminated.

References:

Lesson: etcd client configuration matters. Connections to old leaders can cause timeouts. Always configure proper connection timeouts and retry logic. Monitor etcd cluster health continuously, not just via SLOs.

etcd Upgrade Issues: Zombie Cluster Members

Situation: When upgrading from etcd v3.5 to v3.6, clusters can experience "zombie members"—etcd nodes that were previously removed but reappear and attempt to rejoin consensus.

The problem:

  • v3.5 used v2store as the membership source of truth
  • v3.6 switched to v3store
  • This created inconsistencies between the two stores
  • Removed members could reappear and cause issues

The fix: Upgrade to v3.5.26 or later before moving to v3.6. This ensures proper migration between stores.

Another issue: Upgrading from etcd v3.5.1-v3.5.19 to v3.6.0 can fail with "too many learner member in cluster" because voting members may revert to learner status during promotion.

References:

Lesson: etcd upgrades require careful planning. Always check version compatibility and upgrade paths. Test upgrades in staging first. Even with Raft's consensus guarantees, operational issues can cause problems.

Data Inconsistency Despite Raft

Situation: Even with Raft's consensus guarantees, etcd can experience data inconsistency issues in production clusters.

Causes:

  • Member addition/removal operations
  • Cluster recovery failures
  • Snapshot restoration bugs
  • Operational mistakes during maintenance

References:

Lesson: Raft provides strong guarantees, but operational issues can still cause problems. Always test cluster operations (add/remove members, snapshots) in staging. Monitor cluster health continuously.


Real Systems Using Raft

etcd (Kubernetes)

KUBERNETES ARCHITECTURE:
════════════════════════

┌─────────────────────────────────────────────────────────────────┐
│                        CONTROL PLANE                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐                                                │
│  │ API Server  │ ◄──── kubectl, controllers                     │
│  └──────┬──────┘                                                │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────────────────────────────────────────┐           │
│  │                    etcd cluster                │           │
│  │  ┌───────┐     ┌───────┐     ┌───────┐       │           │
│  │  │ etcd  │◄───►│ etcd  │◄───►│ etcd  │       │           │
│  │  │ (Raft)│     │ (Raft)│     │ (Raft)│       │           │
│  │  └───────┘     └───────┘     └───────┘       │           │
│  └─────────────────────────────────────────────────┘           │
│                                                                 │
│  All cluster state stored in etcd with Raft consensus          │
│  • Pod definitions                                              │
│  • Service configurations                                       │
│  • Secrets                                                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Why etcd for Kubernetes:

  • Strong consistency needed for cluster state
  • Small data volume (metadata, not user data)
  • Low write throughput (configuration changes)
  • Perfect fit for Raft

When to Use Consensus

✅ USE CONSENSUS FOR:

• Configuration management (etcd)
• Leader election
• Distributed locks
• Service discovery
• Metadata storage
• Small, critical data

❌ DON'T USE FOR:

• Large data volumes (use sharded DB)
• High-throughput writes (bottleneck at leader)
• Cross-datacenter (latency too high)
• Eventually consistent is OK (simpler options exist)

Decision Framework

□ Do I need strong consistency?
  → Yes: Consensus is appropriate
  → No: Consider eventual consistency (simpler)

□ Can I tolerate minority unavailability?
  → Yes: Consensus works
  → No: Need different approach (AP system)

□ What's my data volume?
  → Small (GB): Consensus is fine
  → Large (TB+): Shard, with consensus per shard

□ What's my write throughput?
  → Low (<10K/s): Single Raft group works
  → High: Multiple Raft groups or different architecture

Key Takeaways

  1. Raft = strong leader consensus - simple to understand (unlike Paxos)
  2. Majority quorum ensures consistency and fault tolerance
  3. CP system - sacrifices availability during partitions (by design!)
  4. Write latency = replication round-trip time (3-5 nodes typical)
  5. Cluster size - 3 or 5 nodes is typical (more = higher latency)
  6. Know when NOT to use - not for large data or high throughput
  7. Operational issues matter - even with Raft, upgrades and client config can cause problems

What's Next

Now that you understand consensus, the next question is: How do distributed systems handle time when there's no global clock?

In the next article, Time, Clocks & Ordering - Simplicity vs Accuracy, you'll learn:

  • Why physical clocks are unreliable (drift, NTP issues)
  • How logical clocks (Lamport, Vector) solve ordering
  • When to use timestamps vs logical clocks
  • Real-world examples of time-based bugs

This builds on what you learned here—consensus needs ordering, and time is how we usually think about order.

Continue to Article 16: Time, Clocks & Ordering


This article is part of the Backend Engineering Mastery series.