8 min read

#5. TCP Deep Dive - Reliability vs Latency

The One Thing to Remember

TCP trades latency for reliability. Every feature of TCP—the three-way handshake, acknowledgments, retransmissions—adds delay but guarantees delivery. Understanding this trade-off helps you choose the right protocol and debug network issues.


Building on Article 4

In Article 4: CPU Scheduling & Context Switches, you learned how the OS switches between processes when they block on I/O. But here's the question: What are those processes actually waiting for when they do network I/O?

Understanding TCP helps you understand those I/O waits—and why network issues are among the hardest to debug.

Previous: Article 4 - CPU Scheduling & Context Switches


Why This Matters (A Production Horror Story)

I once debugged a service that suddenly stopped accepting new connections. The error: "Cannot assign requested address." Investigation showed 60,000 sockets in TIME_WAIT state. The service was creating a new TCP connection for every HTTP request, and each closed connection sat in TIME_WAIT for 60 seconds. Port exhaustion. The fix? HTTP connection pooling. Two lines of code.

This isn't academic knowledge—it's the difference between:

  • Debugging network issues in hours vs days

    • Understanding TCP states = you know what to check (ss -tan)
    • Not understanding = you blame the load balancer, the firewall, everything
  • Choosing the right protocol

    • Understanding TCP vs UDP = you pick the right tool
    • Not understanding = you use TCP for everything, hit latency limits
  • Building high-throughput systems

    • Understanding connection pooling = you avoid port exhaustion
    • Not understanding = your service crashes under load

Quick Win: Check Your TCP Connections

Before we dive deeper, let's see what your system is doing:

# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Expected healthy output:
#   1500 ESTAB
#     50 TIME-WAIT
#      1 LISTEN

# Problematic outputs:
#  60000 TIME-WAIT    → Connection pooling issue!
#    500 CLOSE-WAIT   → App not closing connections!
#    500 SYN-SENT     → Server not responding!

The TCP Mental Model

TCP is a Reliable Pipe

Imagine sending letters through an unreliable postal service:

  • Letters might get lost
  • Letters might arrive out of order
  • Letters might arrive twice

TCP transforms this into a reliable phone call:

  • Everything arrives
  • Everything arrives in order
  • Everything arrives exactly once

The cost? Extra paperwork (headers, ACKs) and waiting (retransmissions).


The Three-Way Handshake

CLIENT                                        SERVER
   │                                             │
   │──────── SYN (seq=100) ───────────────────► │
   │         "Hi, I want to connect"             │
   │         "My sequence starts at 100"         │
   │                                             │
   │◄─────── SYN-ACK (seq=300, ack=101) ─────── │
   │         "Hi back! I acknowledge your 100"   │
   │         "My sequence starts at 300"         │
   │                                             │
   │──────── ACK (seq=101, ack=301) ──────────► │
   │         "Great, I acknowledge your 300"     │
   │                                             │
   │         CONNECTION ESTABLISHED              │
   │◄────────────────────────────────────────►  │

Why three steps?
1. Client proves it can send
2. Server proves it can send AND receive
3. Client proves it can receive

Both sides now know: "We can communicate bidirectionally"

Cost: 1-2 round-trip times (RTT). On a local network, this is ~1ms. Across continents, it's 100-200ms. This is why connection pooling matters.


TCP State Machine (The Important States)

                    ┌──────────────┐
                    │    CLOSED    │
                    └──────┬───────┘
                           │
        ┌──────────────────┼──────────────────┐
        │ Server           │           Client │
        │ listen()         │          connect()│
        ▼                  │                  ▼
 ┌──────────────┐          │         ┌──────────────┐
 │   LISTEN     │          │         │   SYN_SENT   │
 └──────┬───────┘          │         └──────┬───────┘
        │ recv SYN         │                │ recv SYN-ACK
        ▼                  │                │ send ACK
 ┌──────────────┐          │                │
 │   SYN_RCVD   │          │                │
 └──────┬───────┘          │                │
        │ recv ACK         │                │
        └──────────────────┼────────────────┘
                           │
                    ┌──────▼───────┐
                    │ ESTABLISHED  │◄─── Normal data transfer
                    └──────┬───────┘
                           │
                           │ close()
                           ▼
                    ┌──────────────┐
                    │   FIN_WAIT   │
                    └──────┬───────┘
                           │ recv ACK + FIN
                           ▼
                    ┌──────────────┐
                    │  TIME_WAIT   │◄─── The famous 2*MSL wait!
                    └──────┬───────┘
                           │ 60-120 seconds
                           ▼
                    ┌──────────────┐
                    │    CLOSED    │
                    └──────────────┘

TCP States: What Each Means

State What's Happening Common Problem
LISTEN Server waiting for connections None
SYN_SENT Client waiting for SYN-ACK Server not responding
SYN_RCVD Server waiting for ACK SYN flood attack
ESTABLISHED Normal data flow None
FIN_WAIT_1/2 Closing, waiting for ACK Slow close
TIME_WAIT Waiting 2*MSL (60-120s) Port exhaustion!
CLOSE_WAIT Received FIN, app hasn't closed App bug!

Common Mistakes (I've Made These)

Mistake #1: "Creating a new connection per request is fine"

Why it's wrong: Each connection requires a 3-way handshake (1-2 RTT), and each closed connection sits in TIME_WAIT for 60-120 seconds. At high throughput, you exhaust ephemeral ports.

Real example: An application sending 710 HTTP POST requests/second with non-keep-alive connections accumulated ~28,000 TIME_WAIT connections, matching the ephemeral port range (32768-61000) and preventing new connections.

Right approach: Always use connection pooling for:

  • Database connections
  • HTTP/1.1 (keep-alive) or HTTP/2
  • gRPC channels
  • Redis/Memcached

Mistake #2: "TIME_WAIT is a problem, I should disable it"

Why it's wrong: TIME_WAIT exists for a reason—it prevents old packets from corrupting new connections. Disabling it can cause data corruption.

Right approach: Fix the root cause—use connection pooling. TIME_WAIT is normal, but you shouldn't have thousands of them.

Mistake #3: "CLOSE_WAIT is normal"

Why it's wrong: CLOSE_WAIT means the remote side closed the connection, but your application hasn't called close() yet. This is a bug—you're leaking connections.

Right approach: Find where connections are opened but not closed (often in error handling paths). Use finally blocks or context managers.


Trade-offs: TCP Design Decisions

Trade-off #1: Reliability vs Latency

┌─────────────────────────────────────────────────────────────────┐
│                     RELIABLE (TCP)                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Sender ──► [Data 1] ──► [Data 2] ──► [Data 3] ──►             │
│                ◄─── ACK ───┘     ◄─── ACK ───┘                 │
│                                                                 │
│  If packet lost:                                                │
│  ┌────────────────────────────────────────────────────────┐    │
│  │ Sender waits... timeout... retransmit... wait for ACK │    │
│  │ Total delay: 100-500ms for retransmission!            │    │
│  └────────────────────────────────────────────────────────┘    │
│                                                                 │
│  ✓ Every byte delivered                                        │
│  ✗ One lost packet blocks everything behind it (HOL blocking)  │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                    UNRELIABLE (UDP)                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Sender ──► [Data 1] ──► [Data 2] ──► [Data 3] ──►             │
│              (lost!)        ✓            ✓                      │
│                                                                 │
│  ✓ No waiting for lost packets                                 │
│  ✓ No head-of-line blocking                                    │
│  ✗ Some data never arrives                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When to accept UDP's trade-off:
- Video streaming (old frame not useful)
- Gaming (current position > old position)
- DNS (will retry if lost)
- VoIP (missing audio < delayed audio)

Trade-off #2: Connection Setup Cost vs State Management

NEW CONNECTION EVERY REQUEST:               CONNECTION POOLING:
─────────────────────────────               ───────────────────

For 1000 requests:

1000 × (3-way handshake)                    1 × (3-way handshake)
= 1000 × ~30ms = 30 seconds wasted          + Reuse connection 999 times
                                            = ~30ms total overhead

Plus:                                       Complexity:
- 1000 TIME_WAIT sockets                    - Connection pool management
- Port exhaustion risk                      - Health checking
- TLS renegotiation each time               - Proper cleanup

Always use connection pooling for:
- Database connections
- HTTP/2 connections
- gRPC channels
- Redis/Memcached

Trade-off #3: Nagle's Algorithm vs Latency

The Nagle + Delayed ACK Disaster

This is the most common TCP performance bug I've seen:

CLIENT (Nagle ON)                    SERVER (Delayed ACK ON)
─────────────────                    ───────────────────────

write("GET /")                       
  → Sent immediately (buffer empty)
                              ──────►recv: "GET /"
                                     Wait for more data to ACK together
                                     (delayed ACK: wait up to 40ms)

write(" HTTP/1.1")
  → Nagle says: wait for ACK first
  → Waiting...                      
                                     Timer expires (40ms)
                              ◄──────ACK finally sent!

  → Now send " HTTP/1.1"
                              ──────►Request complete after 40ms delay!

TOTAL ADDED LATENCY: 40ms for a simple HTTP request!

FIX: TCP_NODELAY for request-response protocols

Enable TCP_NODELAY for:

  • Interactive protocols (SSH)
  • Request-response patterns (HTTP, gRPC)
  • Gaming

Keep Nagle (default) for:

  • Bulk transfers
  • Streaming large data
  • File transfers

Real-World Trade-off Stories

AppsFlyer: TCP Connections That Refused to Die

Situation: AppsFlyer Engineering documented a production TCP connection leak that required deep Linux networking stack analysis. The service had connections stuck in various states, preventing proper cleanup.

Investigation: Using ss, netstat, and tcpdump, they traced the issue to application-level socket leaks where connections weren't properly closed, leading to CLOSE_WAIT accumulation.

Key insight: Understanding which side enters TIME_WAIT is critical—TIME_WAIT appears on the side that actively closes (sends FIN first), while CLOSE_WAIT appears on the side receiving the FIN. This distinction is often misunderstood in troubleshooting.

References:

Lesson: Use proper debugging tools (ss, netstat, tcpdump) to monitor connection states. CLOSE_WAIT accumulation means your application has a bug—it's not closing connections properly.

TIME_WAIT Port Exhaustion (Real Production Case)

Situation: An application sending 710 HTTP POST requests/second with non-keep-alive connections accumulated ~28,000 TIME_WAIT connections, matching the ephemeral port range (32768-61000) and preventing new connections.

The math:

  • 710 connections/second closed
  • Each sits in TIME_WAIT for ~30-60 seconds
  • 710 × 30 = ~21,300 TIME_WAITs at any given time
  • Ephemeral port range: ~28,000 ports
  • Result: Port exhaustion, service can't accept new connections

Root causes:

  • Not reusing connections (no connection pooling)
  • High connection throughput without pooling
  • Application bugs: Socket leaks where applications don't properly close sockets

Solutions:

  1. Connection pooling (best solution)
  2. SO_REUSEADDR on server
  3. tcp_tw_reuse=1 (careful—can cause issues)
  4. Increase port range: ip_local_port_range

References:

Lesson: Never create a new TCP connection per request in high-throughput scenarios. Connection pooling is not optional—it's essential.


Code Examples

Setting TCP Options

import socket

def create_optimized_connection(host, port):
    """Create a TCP connection with optimal settings"""
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    
    # Disable Nagle's algorithm for request-response
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
    
    # Enable keepalive to detect dead connections
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
    
    # Linux-specific: Tune keepalive timing
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60)
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10)
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5)
    
    sock.connect((host, port))
    return sock

Simple Connection Pool

import socket
import queue

class ConnectionPool:
    """Simple connection pool to avoid TCP overhead"""
    
    def __init__(self, host, port, max_connections=10):
        self.host = host
        self.port = port
        self.pool = queue.Queue(maxsize=max_connections)
        
        # Pre-create connections
        for _ in range(max_connections):
            conn = self._create_connection()
            self.pool.put(conn)
    
    def _create_connection(self):
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
        sock.connect((self.host, self.port))
        return sock
    
    def get_connection(self, timeout=5):
        """Borrow a connection from the pool"""
        try:
            return self.pool.get(timeout=timeout)
        except queue.Empty:
            return self._create_connection()
    
    def return_connection(self, conn):
        """Return a connection to the pool"""
        try:
            conn.getpeername()  # Raises if disconnected
            self.pool.put_nowait(conn)
        except (socket.error, queue.Full):
            try:
                conn.close()
            except:
                pass

# Usage
pool = ConnectionPool('localhost', 8080, max_connections=10)

# Instead of: socket.connect() for every request
conn = pool.get_connection()
try:
    conn.send(b"GET / HTTP/1.1\r\n\r\n")
    response = conn.recv(1024)
finally:
    pool.return_connection(conn)  # Don't close, return to pool!

Debugging TCP Issues

See Connection States

# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Find process using a port
ss -tlnp | grep :8080

# See all connections to a specific port
ss -tan 'dport = :443'

# Watch connections in real-time
watch -n 1 'ss -tan | head -20'

Analyze TCP Settings

# Check buffer sizes
sysctl net.core.rmem_max
sysctl net.core.wmem_max

# Check TIME_WAIT settings
sysctl net.ipv4.tcp_tw_reuse

# Check port range
sysctl net.ipv4.ip_local_port_range

Decision Framework

□ What's my latency requirement?
  → <10ms: TCP_NODELAY, connection pooling
  → <100ms: Standard TCP is fine
  → <1s: Anything works

□ What's my throughput requirement?
  → >10K requests/sec: Connection pooling mandatory
  → >100K: Consider HTTP/2 or gRPC multiplexing
  
□ Is data loss acceptable?
  → No: TCP
  → Yes (real-time): UDP

□ Am I seeing TIME_WAIT accumulation?
  → Fix: Connection pooling
  → Temporary: SO_REUSEADDR, tcp_tw_reuse

□ Am I seeing CLOSE_WAIT accumulation?  
  → Fix: Find and fix connection leak in your code

Key Numbers to Know

Metric Typical Value Notes
Handshake latency 1-2 RTT 3-way handshake
TIME_WAIT duration 60-120s 2 * MSL
Default backlog 128 Increase for high-traffic servers
Ephemeral port range 32768-60999 ~28K ports
TCP keepalive default 2 hours Too long! Customize it

Memory Trick

"SYN-ACK-DATA-FIN" is like a phone call:

  • SYN: Dialing (ring ring...)
  • SYN-ACK: "Hello?" (picked up)
  • ACK: "Hi, it's me" (confirmed)
  • DATA: The conversation
  • FIN: "Bye!" "Bye!" (mutual hang up)

Self-Assessment

Before moving on:

  • [ ] Can you draw the three-way handshake from memory?
  • [ ] Do you know why TIME_WAIT exists and how to manage it?
  • [ ] Can you diagnose connection problems from ss -tan output?
  • [ ] Know when to use TCP_NODELAY?
  • [ ] Understand the Nagle + Delayed ACK interaction?
  • [ ] Know the difference between TIME_WAIT and CLOSE_WAIT?

Key Takeaways

  1. TCP trades latency for reliability - every byte delivered, in order, exactly once
  2. Connection pooling is essential for high-throughput systems (not optional!)
  3. TIME_WAIT is normal but can exhaust ports without pooling
  4. CLOSE_WAIT is a bug in your application (not closing connections)
  5. TCP_NODELAY for request-response protocols (avoids Nagle + Delayed ACK trap)
  6. Always measure - network issues are subtle, use ss, tcpdump, netstat

What's Next

Now that you understand TCP, the next question is: How has HTTP evolved to work better over TCP?

In the next article, HTTP Evolution (1.1→2→3) - Simplicity vs Performance, you'll learn:

  • Why HTTP/2 uses multiplexing (solving TCP head-of-line blocking)
  • Why HTTP/3 uses UDP (QUIC protocol)
  • The trade-offs between simplicity and performance
  • When to use each version in production

This builds on what you learned here—HTTP/2 and HTTP/3 are attempts to work around TCP's limitations while keeping its benefits.

Continue to Article 6: HTTP Evolution


This article is part of the Backend Engineering Mastery series. TCP knowledge is fundamental for debugging network issues.