#5. TCP Deep Dive - Reliability vs Latency
The One Thing to Remember
TCP trades latency for reliability. Every feature of TCP—the three-way handshake, acknowledgments, retransmissions—adds delay but guarantees delivery. Understanding this trade-off helps you choose the right protocol and debug network issues.
Building on Article 4
In Article 4: CPU Scheduling & Context Switches, you learned how the OS switches between processes when they block on I/O. But here's the question: What are those processes actually waiting for when they do network I/O?
Understanding TCP helps you understand those I/O waits—and why network issues are among the hardest to debug.
← Previous: Article 4 - CPU Scheduling & Context Switches
Why This Matters (A Production Horror Story)
I once debugged a service that suddenly stopped accepting new connections. The error: "Cannot assign requested address." Investigation showed 60,000 sockets in TIME_WAIT state. The service was creating a new TCP connection for every HTTP request, and each closed connection sat in TIME_WAIT for 60 seconds. Port exhaustion. The fix? HTTP connection pooling. Two lines of code.
This isn't academic knowledge—it's the difference between:
-
Debugging network issues in hours vs days
- Understanding TCP states = you know what to check (
ss -tan) - Not understanding = you blame the load balancer, the firewall, everything
- Understanding TCP states = you know what to check (
-
Choosing the right protocol
- Understanding TCP vs UDP = you pick the right tool
- Not understanding = you use TCP for everything, hit latency limits
-
Building high-throughput systems
- Understanding connection pooling = you avoid port exhaustion
- Not understanding = your service crashes under load
Quick Win: Check Your TCP Connections
Before we dive deeper, let's see what your system is doing:
# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Expected healthy output:
# 1500 ESTAB
# 50 TIME-WAIT
# 1 LISTEN
# Problematic outputs:
# 60000 TIME-WAIT → Connection pooling issue!
# 500 CLOSE-WAIT → App not closing connections!
# 500 SYN-SENT → Server not responding!
The TCP Mental Model
TCP is a Reliable Pipe
Imagine sending letters through an unreliable postal service:
- Letters might get lost
- Letters might arrive out of order
- Letters might arrive twice
TCP transforms this into a reliable phone call:
- Everything arrives
- Everything arrives in order
- Everything arrives exactly once
The cost? Extra paperwork (headers, ACKs) and waiting (retransmissions).
The Three-Way Handshake
CLIENT SERVER
│ │
│──────── SYN (seq=100) ───────────────────► │
│ "Hi, I want to connect" │
│ "My sequence starts at 100" │
│ │
│◄─────── SYN-ACK (seq=300, ack=101) ─────── │
│ "Hi back! I acknowledge your 100" │
│ "My sequence starts at 300" │
│ │
│──────── ACK (seq=101, ack=301) ──────────► │
│ "Great, I acknowledge your 300" │
│ │
│ CONNECTION ESTABLISHED │
│◄────────────────────────────────────────► │
Why three steps?
1. Client proves it can send
2. Server proves it can send AND receive
3. Client proves it can receive
Both sides now know: "We can communicate bidirectionally"
Cost: 1-2 round-trip times (RTT). On a local network, this is ~1ms. Across continents, it's 100-200ms. This is why connection pooling matters.
TCP State Machine (The Important States)
┌──────────────┐
│ CLOSED │
└──────┬───────┘
│
┌──────────────────┼──────────────────┐
│ Server │ Client │
│ listen() │ connect()│
▼ │ ▼
┌──────────────┐ │ ┌──────────────┐
│ LISTEN │ │ │ SYN_SENT │
└──────┬───────┘ │ └──────┬───────┘
│ recv SYN │ │ recv SYN-ACK
▼ │ │ send ACK
┌──────────────┐ │ │
│ SYN_RCVD │ │ │
└──────┬───────┘ │ │
│ recv ACK │ │
└──────────────────┼────────────────┘
│
┌──────▼───────┐
│ ESTABLISHED │◄─── Normal data transfer
└──────┬───────┘
│
│ close()
▼
┌──────────────┐
│ FIN_WAIT │
└──────┬───────┘
│ recv ACK + FIN
▼
┌──────────────┐
│ TIME_WAIT │◄─── The famous 2*MSL wait!
└──────┬───────┘
│ 60-120 seconds
▼
┌──────────────┐
│ CLOSED │
└──────────────┘
TCP States: What Each Means
| State | What's Happening | Common Problem |
|---|---|---|
| LISTEN | Server waiting for connections | None |
| SYN_SENT | Client waiting for SYN-ACK | Server not responding |
| SYN_RCVD | Server waiting for ACK | SYN flood attack |
| ESTABLISHED | Normal data flow | None |
| FIN_WAIT_1/2 | Closing, waiting for ACK | Slow close |
| TIME_WAIT | Waiting 2*MSL (60-120s) | Port exhaustion! |
| CLOSE_WAIT | Received FIN, app hasn't closed | App bug! |
Common Mistakes (I've Made These)
Mistake #1: "Creating a new connection per request is fine"
Why it's wrong: Each connection requires a 3-way handshake (1-2 RTT), and each closed connection sits in TIME_WAIT for 60-120 seconds. At high throughput, you exhaust ephemeral ports.
Real example: An application sending 710 HTTP POST requests/second with non-keep-alive connections accumulated ~28,000 TIME_WAIT connections, matching the ephemeral port range (32768-61000) and preventing new connections.
Right approach: Always use connection pooling for:
- Database connections
- HTTP/1.1 (keep-alive) or HTTP/2
- gRPC channels
- Redis/Memcached
Mistake #2: "TIME_WAIT is a problem, I should disable it"
Why it's wrong: TIME_WAIT exists for a reason—it prevents old packets from corrupting new connections. Disabling it can cause data corruption.
Right approach: Fix the root cause—use connection pooling. TIME_WAIT is normal, but you shouldn't have thousands of them.
Mistake #3: "CLOSE_WAIT is normal"
Why it's wrong: CLOSE_WAIT means the remote side closed the connection, but your application hasn't called close() yet. This is a bug—you're leaking connections.
Right approach: Find where connections are opened but not closed (often in error handling paths). Use finally blocks or context managers.
Trade-offs: TCP Design Decisions
Trade-off #1: Reliability vs Latency
┌─────────────────────────────────────────────────────────────────┐
│ RELIABLE (TCP) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Sender ──► [Data 1] ──► [Data 2] ──► [Data 3] ──► │
│ ◄─── ACK ───┘ ◄─── ACK ───┘ │
│ │
│ If packet lost: │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Sender waits... timeout... retransmit... wait for ACK │ │
│ │ Total delay: 100-500ms for retransmission! │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ ✓ Every byte delivered │
│ ✗ One lost packet blocks everything behind it (HOL blocking) │
│ │
├─────────────────────────────────────────────────────────────────┤
│ UNRELIABLE (UDP) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Sender ──► [Data 1] ──► [Data 2] ──► [Data 3] ──► │
│ (lost!) ✓ ✓ │
│ │
│ ✓ No waiting for lost packets │
│ ✓ No head-of-line blocking │
│ ✗ Some data never arrives │
│ │
└─────────────────────────────────────────────────────────────────┘
When to accept UDP's trade-off:
- Video streaming (old frame not useful)
- Gaming (current position > old position)
- DNS (will retry if lost)
- VoIP (missing audio < delayed audio)
Trade-off #2: Connection Setup Cost vs State Management
NEW CONNECTION EVERY REQUEST: CONNECTION POOLING:
───────────────────────────── ───────────────────
For 1000 requests:
1000 × (3-way handshake) 1 × (3-way handshake)
= 1000 × ~30ms = 30 seconds wasted + Reuse connection 999 times
= ~30ms total overhead
Plus: Complexity:
- 1000 TIME_WAIT sockets - Connection pool management
- Port exhaustion risk - Health checking
- TLS renegotiation each time - Proper cleanup
Always use connection pooling for:
- Database connections
- HTTP/2 connections
- gRPC channels
- Redis/Memcached
Trade-off #3: Nagle's Algorithm vs Latency
The Nagle + Delayed ACK Disaster
This is the most common TCP performance bug I've seen:
CLIENT (Nagle ON) SERVER (Delayed ACK ON)
───────────────── ───────────────────────
write("GET /")
→ Sent immediately (buffer empty)
──────►recv: "GET /"
Wait for more data to ACK together
(delayed ACK: wait up to 40ms)
write(" HTTP/1.1")
→ Nagle says: wait for ACK first
→ Waiting...
Timer expires (40ms)
◄──────ACK finally sent!
→ Now send " HTTP/1.1"
──────►Request complete after 40ms delay!
TOTAL ADDED LATENCY: 40ms for a simple HTTP request!
FIX: TCP_NODELAY for request-response protocols
Enable TCP_NODELAY for:
- Interactive protocols (SSH)
- Request-response patterns (HTTP, gRPC)
- Gaming
Keep Nagle (default) for:
- Bulk transfers
- Streaming large data
- File transfers
Real-World Trade-off Stories
AppsFlyer: TCP Connections That Refused to Die
Situation: AppsFlyer Engineering documented a production TCP connection leak that required deep Linux networking stack analysis. The service had connections stuck in various states, preventing proper cleanup.
Investigation: Using ss, netstat, and tcpdump, they traced the issue to application-level socket leaks where connections weren't properly closed, leading to CLOSE_WAIT accumulation.
Key insight: Understanding which side enters TIME_WAIT is critical—TIME_WAIT appears on the side that actively closes (sends FIN first), while CLOSE_WAIT appears on the side receiving the FIN. This distinction is often misunderstood in troubleshooting.
References:
Lesson: Use proper debugging tools (ss, netstat, tcpdump) to monitor connection states. CLOSE_WAIT accumulation means your application has a bug—it's not closing connections properly.
TIME_WAIT Port Exhaustion (Real Production Case)
Situation: An application sending 710 HTTP POST requests/second with non-keep-alive connections accumulated ~28,000 TIME_WAIT connections, matching the ephemeral port range (32768-61000) and preventing new connections.
The math:
- 710 connections/second closed
- Each sits in TIME_WAIT for ~30-60 seconds
- 710 × 30 = ~21,300 TIME_WAITs at any given time
- Ephemeral port range: ~28,000 ports
- Result: Port exhaustion, service can't accept new connections
Root causes:
- Not reusing connections (no connection pooling)
- High connection throughput without pooling
- Application bugs: Socket leaks where applications don't properly close sockets
Solutions:
- Connection pooling (best solution)
SO_REUSEADDRon servertcp_tw_reuse=1(careful—can cause issues)- Increase port range:
ip_local_port_range
References:
Lesson: Never create a new TCP connection per request in high-throughput scenarios. Connection pooling is not optional—it's essential.
Code Examples
Setting TCP Options
import socket
def create_optimized_connection(host, port):
"""Create a TCP connection with optimal settings"""
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Disable Nagle's algorithm for request-response
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
# Enable keepalive to detect dead connections
sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
# Linux-specific: Tune keepalive timing
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5)
sock.connect((host, port))
return sock
Simple Connection Pool
import socket
import queue
class ConnectionPool:
"""Simple connection pool to avoid TCP overhead"""
def __init__(self, host, port, max_connections=10):
self.host = host
self.port = port
self.pool = queue.Queue(maxsize=max_connections)
# Pre-create connections
for _ in range(max_connections):
conn = self._create_connection()
self.pool.put(conn)
def _create_connection(self):
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
sock.connect((self.host, self.port))
return sock
def get_connection(self, timeout=5):
"""Borrow a connection from the pool"""
try:
return self.pool.get(timeout=timeout)
except queue.Empty:
return self._create_connection()
def return_connection(self, conn):
"""Return a connection to the pool"""
try:
conn.getpeername() # Raises if disconnected
self.pool.put_nowait(conn)
except (socket.error, queue.Full):
try:
conn.close()
except:
pass
# Usage
pool = ConnectionPool('localhost', 8080, max_connections=10)
# Instead of: socket.connect() for every request
conn = pool.get_connection()
try:
conn.send(b"GET / HTTP/1.1\r\n\r\n")
response = conn.recv(1024)
finally:
pool.return_connection(conn) # Don't close, return to pool!
Debugging TCP Issues
See Connection States
# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Find process using a port
ss -tlnp | grep :8080
# See all connections to a specific port
ss -tan 'dport = :443'
# Watch connections in real-time
watch -n 1 'ss -tan | head -20'
Analyze TCP Settings
# Check buffer sizes
sysctl net.core.rmem_max
sysctl net.core.wmem_max
# Check TIME_WAIT settings
sysctl net.ipv4.tcp_tw_reuse
# Check port range
sysctl net.ipv4.ip_local_port_range
Decision Framework
□ What's my latency requirement?
→ <10ms: TCP_NODELAY, connection pooling
→ <100ms: Standard TCP is fine
→ <1s: Anything works
□ What's my throughput requirement?
→ >10K requests/sec: Connection pooling mandatory
→ >100K: Consider HTTP/2 or gRPC multiplexing
□ Is data loss acceptable?
→ No: TCP
→ Yes (real-time): UDP
□ Am I seeing TIME_WAIT accumulation?
→ Fix: Connection pooling
→ Temporary: SO_REUSEADDR, tcp_tw_reuse
□ Am I seeing CLOSE_WAIT accumulation?
→ Fix: Find and fix connection leak in your code
Key Numbers to Know
| Metric | Typical Value | Notes |
|---|---|---|
| Handshake latency | 1-2 RTT | 3-way handshake |
| TIME_WAIT duration | 60-120s | 2 * MSL |
| Default backlog | 128 | Increase for high-traffic servers |
| Ephemeral port range | 32768-60999 | ~28K ports |
| TCP keepalive default | 2 hours | Too long! Customize it |
Memory Trick
"SYN-ACK-DATA-FIN" is like a phone call:
- SYN: Dialing (ring ring...)
- SYN-ACK: "Hello?" (picked up)
- ACK: "Hi, it's me" (confirmed)
- DATA: The conversation
- FIN: "Bye!" "Bye!" (mutual hang up)
Self-Assessment
Before moving on:
- [ ] Can you draw the three-way handshake from memory?
- [ ] Do you know why TIME_WAIT exists and how to manage it?
- [ ] Can you diagnose connection problems from
ss -tanoutput? - [ ] Know when to use TCP_NODELAY?
- [ ] Understand the Nagle + Delayed ACK interaction?
- [ ] Know the difference between TIME_WAIT and CLOSE_WAIT?
Key Takeaways
- TCP trades latency for reliability - every byte delivered, in order, exactly once
- Connection pooling is essential for high-throughput systems (not optional!)
- TIME_WAIT is normal but can exhaust ports without pooling
- CLOSE_WAIT is a bug in your application (not closing connections)
- TCP_NODELAY for request-response protocols (avoids Nagle + Delayed ACK trap)
- Always measure - network issues are subtle, use
ss,tcpdump,netstat
What's Next
Now that you understand TCP, the next question is: How has HTTP evolved to work better over TCP?
In the next article, HTTP Evolution (1.1→2→3) - Simplicity vs Performance, you'll learn:
- Why HTTP/2 uses multiplexing (solving TCP head-of-line blocking)
- Why HTTP/3 uses UDP (QUIC protocol)
- The trade-offs between simplicity and performance
- When to use each version in production
This builds on what you learned here—HTTP/2 and HTTP/3 are attempts to work around TCP's limitations while keeping its benefits.
→ Continue to Article 6: HTTP Evolution
This article is part of the Backend Engineering Mastery series. TCP knowledge is fundamental for debugging network issues.