25 Jan 2026 11 min read Backend Engineering Mastery

File I/O & Durability - Why fsync() Is Your Best Friend (And Worst Enemy)

Series: Backend Engineering Mastery
Reading Time: 13 minutes
Level: Junior to Senior Engineers

The One Thing to Remember

fsync() is your durability guarantee. Without it, your "written" data might just be sitting in kernel buffers, vulnerable to power loss. With it, you trade performance for safety.

But here's the scary part: Ever had write() return success, only to lose data on a power outage? That's because write() lies—it returns when data is in memory, not on disk.

Building on Article 2

In Article 2: Memory Management, you learned about the page cache—that's where your data goes when you call write(). But here's the critical question: How does data actually get from that page cache to disk?

The journey from your application's memory to physical storage is full of trade-offs. Understanding this path is the difference between losing hours of data and having true durability.

← Previous: Article 2 - Memory Management

Why This Matters

I've seen:

A startup lose 2 hours of user data after a power outage because they didn't understand write durability
A database migration fail silently because rename() wasn't followed by fsync()
An engineer spend 3 days debugging "data corruption" that was actually incomplete writes
An AI training job lose checkpoint data because the model weights weren't synced to disk

This isn't academic knowledge—it's the difference between:

Losing data vs keeping it safe
- Understanding fsync() = you know when data is truly durable
- Not understanding = you think data is safe, but it's not
Debugging in hours vs days
- Knowing the write path = you immediately check fsync() calls
- Not knowing = you blame the database, the network, everything except the actual problem
Choosing the right durability strategy
- Understanding trade-offs = you pick fsync frequency that matches your needs
- Not understanding = you either lose data or kill performance

Understanding the write path is the difference between "I think the data is saved" and "I know the data is saved."

Quick Win: Check Your Write Durability

Before we dive deeper, let's see if your system is actually syncing data:

# See how much data is waiting to be written (dirty pages)
cat /proc/meminfo | grep -E "Dirty|Writeback"

# Watch dirty pages in real-time during writes
watch -n 1 'cat /proc/meminfo | grep Dirty'

# Check if a process is calling fsync()
strace -e fsync,fdatasync -p $(pgrep your-app | head -1)

What to look for:

High Dirty pages: Lots of data waiting to be written (potential data loss risk)
No fsync() calls: Your app might not be syncing (data loss on crash!)
Frequent fsync(): Good for durability, but might be slow

The Write Path: From App to Disk

When you call write(), your data takes a long journey:

APPLICATION                    KERNEL                         HARDWARE
┌─────────────────┐        ┌─────────────────┐            ┌─────────────────┐
│                 │        │                 │            │                 │
│  Application    │        │   Page Cache    │            │  Disk           │
│  Buffer         │        │   (RAM)         │            │  (Persistent)   │
│                 │        │                 │            │                 │
│  "Hello World"  │───────►│  "Hello World"  │───────────►│  "Hello World"  │
│                 │ write()│                 │   Later    │                 │
│                 │        │                 │  (maybe)   │                 │
└─────────────────┘        └─────────────────┘            └─────────────────┘
                                   │                              │
                                   │                              │
                            [ VULNERABLE ]                 [ DURABLE ]
                            Power loss here                Power loss here
                            = data LOST                    = data SAFE

The Four Stages of a Write

Stage	Location	Speed	Durability
1. App buffer	Your process memory	Instant	Lost on crash
2. Page cache	Kernel memory	Fast (~μs)	Lost on power loss
3. Disk cache	Drive's own RAM	N/A	Lost on power loss*
4. Disk platters/cells	Physical storage	Slow (~ms)	Durable

*Enterprise SSDs with capacitors can flush their cache on power loss

Quick Jargon Buster

Page Cache: Kernel's memory buffer for file data (from Article 2!)
fsync(): System call that forces data from page cache to disk
fdatasync(): Like fsync() but only syncs data, not metadata (faster)
O_DIRECT: Flag to bypass page cache entirely (advanced, used by databases)
Dirty Pages: Modified pages in cache that haven't been written to disk yet
Write-Ahead Log (WAL): Write operations to log first, then apply (databases use this)

Visual: Write Modes Compared

BUFFERED WRITE (Default - DANGEROUS for important data):
═══════════════════════════════════════════════════════

   App ──► write() ──► Page Cache ──► [sometime later...] ──► Disk
                              │
                              └──► write() returns SUCCESS immediately!
   
   Timeline:
   ────────────────────────────────────────────────────────►
   │         │                                      │
   write()   Returns                               Actually on disk
   called    "success"                             (30 sec later??)
   
   ⚠️  Power loss between write() and disk sync = DATA LOST


WRITE + FSYNC (SAFE):
════════════════════

   App ──► write() ──► Page Cache ──► fsync() ──► Disk ──► Returns
                                                              │
                                                              │
   Timeline:                                                  │
   ────────────────────────────────────────────────────────►  │
   │         │                        │                       │
   write()   In page cache           fsync()                  Success!
   called                            waits for disk           Data is safe
   
   ✅ When fsync() returns, data is on physical storage


O_DIRECT (Bypass page cache - SPECIALIZED):
═══════════════════════════════════════════

   App ──► write() ──────────────────────────────────► Disk ──► Returns
   
   - Bypasses page cache entirely
   - App must handle its own buffering
   - Used by databases that know better than the OS
   
   ✅ Control over when data hits disk
   ⚠️ No OS-level caching benefits

Trade-offs: The Durability-Performance Spectrum

The Fundamental Trade-off

┌─────────────────────────────────────────────────────────────────────┐
│                DURABILITY ◄───────────────────────► PERFORMANCE     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  fsync every       fsync every       fsync on         No fsync     │
│    write            second          close only        (buffered)   │
│     │                  │                │                │         │
│     ▼                  ▼                ▼                ▼         │
│  ┌───────┐        ┌───────┐        ┌───────┐        ┌───────┐     │
│  │~1,000 │        │~50,000│        │~200,000│       │~500,000│    │
│  │writes │        │writes │        │writes  │       │writes  │    │
│  │/sec   │        │/sec   │        │/sec    │       │/sec    │    │
│  └───────┘        └───────┘        └───────┘        └───────┘     │
│                                                                     │
│  Zero data loss   Lose ~1 sec      Lose data since   Lose all     │
│  on power loss    on power loss    last close        buffered data│
│                                                                     │
│  Use: Databases,  Use: Logs,       Use: Temp files,  Use: Caches, │
│  Financial txns   important data   build artifacts   scratch data │
└─────────────────────────────────────────────────────────────────────┘

Redis Persistence Modes: A Perfect Case Study

┌─────────────────────────────────────────────────────────────────┐
│                 REDIS AOF PERSISTENCE OPTIONS                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  appendfsync always        appendfsync everysec    appendfsync no│
│  ─────────────────        ─────────────────────   ──────────────│
│                                                                 │
│  fsync() after EVERY      fsync() once per        fsync() never │
│  write command            second                  (OS decides)  │
│                                                                 │
│  Performance: ~1K ops/s   Performance: ~100K/s    Performance: Max│
│  Data loss: None          Data loss: ~1 second    Data loss: 30s+│
│                                                                 │
│  Use for:                 Use for:                Use for:      │
│  - Financial data         - Most use cases        - Pure cache  │
│  - Can't lose anything    - Good balance          - Replaceable │
└─────────────────────────────────────────────────────────────────┘

Code Examples

Unsafe vs Safe Write

import os
import time

def unsafe_write(filename, data):
    """
    Data might be lost on power failure!
    write() returns when data is in page cache, NOT on disk.
    """
    with open(filename, 'w') as f:
        f.write(data)
    # File closed, but data might still be in kernel buffers!

def safe_write(filename, data):
    """
    Data survives power failure.
    """
    with open(filename, 'w') as f:
        f.write(data)
        f.flush()        # Flush Python's internal buffer to OS
        os.fsync(f.fileno())  # Flush OS buffer to disk
    
    # IMPORTANT: Also sync the directory for the file metadata!
    dir_fd = os.open(os.path.dirname(os.path.abspath(filename)), os.O_RDONLY)
    os.fsync(dir_fd)
    os.close(dir_fd)

# Benchmark the difference
data = "x" * 10000  # 10KB

start = time.time()
for i in range(100):
    unsafe_write(f'/tmp/unsafe_{i}.txt', data)
print(f"Unsafe writes: {time.time() - start:.3f}s")

start = time.time()
for i in range(100):
    safe_write(f'/tmp/safe_{i}.txt', data)
print(f"Safe writes: {time.time() - start:.3f}s")

# Typical output:
# Unsafe writes: 0.015s
# Safe writes: 1.500s  (100x slower!)

The Atomic Write Pattern (Gold Standard)

import os
import tempfile

def atomic_write(filepath, data):
    """
    The safest write pattern:
    1. Write to temp file
    2. fsync temp file
    3. Rename (atomic on POSIX)
    4. fsync directory
    
    At no point is the file in a partially-written state!
    """
    directory = os.path.dirname(os.path.abspath(filepath))
    
    # Create temp file in same directory (important for rename!)
    fd, tmp_path = tempfile.mkstemp(dir=directory)
    
    try:
        # Write data to temp file
        os.write(fd, data.encode() if isinstance(data, str) else data)
        
        # Ensure data is on disk
        os.fsync(fd)
        os.close(fd)
        
        # Atomic rename - either completes fully or not at all
        os.rename(tmp_path, filepath)
        
        # Sync directory to persist the rename
        dir_fd = os.open(directory, os.O_RDONLY)
        os.fsync(dir_fd)
        os.close(dir_fd)
        
    except:
        # Clean up temp file on error
        os.close(fd)
        os.unlink(tmp_path)
        raise

# Usage
atomic_write('/tmp/important.txt', 'critical data')

# Even if power fails during this:
# - Old file is intact, OR
# - New file is complete
# NEVER a partially-written file!

Database-Style Write-Ahead Logging

import os
import json
import time

class WriteAheadLog:
    """
    Simple WAL implementation demonstrating durability pattern.
    
    Sequence:
    1. Write operation to log file (with fsync)
    2. Apply operation to main data
    3. Periodically checkpoint (consolidate log)
    """
    
    def __init__(self, data_file, log_file):
        self.data_file = data_file
        self.log_file = log_file
        self.data = self._load_data()
        self._replay_log()
    
    def _load_data(self):
        try:
            with open(self.data_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def _replay_log(self):
        """Replay any operations from log after crash"""
        try:
            with open(self.log_file, 'r') as f:
                for line in f:
                    op = json.loads(line)
                    self._apply_op(op, log=False)
        except FileNotFoundError:
            pass
    
    def _apply_op(self, op, log=True):
        if log:
            # FIRST: Log the operation (durable)
            with open(self.log_file, 'a') as f:
                f.write(json.dumps(op) + '\n')
                f.flush()
                os.fsync(f.fileno())
        
        # THEN: Apply to in-memory state
        if op['type'] == 'set':
            self.data[op['key']] = op['value']
        elif op['type'] == 'delete':
            self.data.pop(op['key'], None)
    
    def set(self, key, value):
        self._apply_op({'type': 'set', 'key': key, 'value': value})
    
    def get(self, key):
        return self.data.get(key)
    
    def checkpoint(self):
        """Write full state to data file, clear log"""
        # Write new data file atomically
        atomic_write(self.data_file, json.dumps(self.data))
        # Clear log
        with open(self.log_file, 'w') as f:
            os.fsync(f.fileno())

# Usage
wal = WriteAheadLog('/tmp/data.json', '/tmp/wal.log')
wal.set('user:1', {'name': 'Alice', 'balance': 100})
# Even if crash here, data is safe in WAL

Real-World Trade-off Stories

PostgreSQL's fsync() Disaster (2018)

Situation: PostgreSQL called fsync() on files, but some Linux kernels had a bug where fsync() errors weren't properly reported.

What happened:

Kernel buffer writeback failed
fsync() returned success (bug!)
PostgreSQL thought data was safe
Data was actually lost

The fix: PostgreSQL now retries writes on fsync failure and has added extensive error checking.

Lesson: Even "safe" operations can have edge cases. Defense in depth matters.

Spotify's Write Amplification

Situation: Spotify was syncing user state to disk very frequently for durability.

Problem:

Each small write triggered a full 4KB page write
Many small writes = many 4KB writes = "write amplification"
SSDs were wearing out prematurely

Solution:

Batch small writes into larger ones
Use a write-ahead log pattern
Checkpoint periodically instead of syncing every change

Lesson: fsync() is expensive. Batch operations when possible.

MongoDB's Controversial Default

Situation: Early MongoDB defaulted to unacknowledged writes.

What users expected: "Write complete" = data is safe
What actually happened: "Write complete" = data is in memory

Result: Multiple companies lost data during crashes.

MongoDB's response: Added write concern options, changed defaults.

// Old dangerous default
db.users.insert({name: "Alice"})  // Returns before durable!

// New safer approach
db.users.insert({name: "Alice"}, {writeConcern: {w: 1, j: true}})
// w: 1 = acknowledged by primary
// j: true = journaled (fsync'd)

Lesson: Understand your database's durability guarantees. Don't assume.

AI Model Checkpointing: Durability vs Training Speed

Situation: Training a large ML model for hours, need to checkpoint periodically. Power outage could lose days of work.

The problem:

Model weights are large (GBs)
fsync() is slow (100-500μs per sync)
Checkpointing too often slows training
Checkpointing too rarely risks losing work

Solution pattern:

def checkpoint_model(model, checkpoint_path):
    """Atomic checkpoint write for ML models"""
    # 1. Write to temp file (same directory for atomic rename)
    tmp_path = checkpoint_path + '.tmp'
    
    # 2. Save model to temp file
    torch.save(model.state_dict(), tmp_path)
    
    # 3. Sync to disk (critical!)
    import os
    fd = os.open(tmp_path, os.O_RDONLY)
    os.fsync(fd)
    os.close(fd)
    
    # 4. Atomic rename
    os.rename(tmp_path, checkpoint_path)
    
    # 5. Sync directory (for metadata)
    dir_fd = os.open(os.path.dirname(checkpoint_path), os.O_RDONLY)
    os.fsync(dir_fd)
    os.close(dir_fd)

Trade-off:

Checkpoint every N epochs: Balance between safety and speed
Use async checkpointing: Save in background thread (but still fsync!)
Compress checkpoints: Smaller files = faster fsync()

Lesson: For long-running AI training jobs, checkpoint durability is critical. Use atomic writes and verify with actual power-off tests.

Debugging I/O Issues

See Dirty Pages (Unflushed Data)

# How much data is waiting to be written to disk?
cat /proc/meminfo | grep -E "Dirty|Writeback"

Dirty:           12345 kB   # Data modified, not yet written
Writeback:           0 kB   # Data currently being written

# High Dirty = lots of buffered writes
# Watch it during your application's writes
watch -n 1 'cat /proc/meminfo | grep Dirty'

Monitor Disk I/O

# Real-time I/O statistics
iostat -x 1

Device   r/s     w/s     await  %util
sda      100     500     2.5    45%

# Key metrics:
# w/s = writes per second
# await = average wait time (ms) - should be < 10ms for SSD
# %util = how busy the disk is - 100% = saturated

Trace System Calls

# See what a process is writing
strace -e write,fsync,fdatasync -p <PID>

# Example output:
write(3, "data here...", 100) = 100
fsync(3) = 0   # <-- This is where durability happens!

Check Filesystem Mount Options

# See how your filesystem is mounted
mount | grep "on / "

# Look for:
# data=ordered   - metadata journaled, data written before metadata (default)
# data=journal   - both journaled (safest, slowest)
# data=writeback - fastest, least safe

# For ext4, check:
tune2fs -l /dev/sda1 | grep "Default mount"

Common Confusions (Cleared Up)

"write() returns, so data is on disk!"

Reality: write() returns when data is in the page cache (kernel memory), not on disk. It might be written later, or it might be lost on power failure.

"close() flushes data to disk!"

Reality: close() flushes your application's buffer to the OS page cache, but doesn't sync to disk. You still need fsync() for durability.

"rename() is atomic, so I'm safe!"

Reality: rename() is atomic for the operation, but you still need to:

fsync() the file contents before rename
fsync() the directory after rename

Otherwise, the rename might complete but the file contents could be lost.

"My database handles this, I don't need to worry!"

Reality: Database defaults vary. Many default to performance over durability. Always check your database's durability settings and test with actual power loss.

Common Mistakes

Mistake #1: "Closing a file flushes it to disk"

Why it's wrong: close() flushes Python/C buffers to OS, not OS buffers to disk.

# WRONG - Not durable!
with open('file.txt', 'w') as f:
    f.write('important data')
# File closed, but data may still be in page cache!

# RIGHT - Durable
with open('file.txt', 'w') as f:
    f.write('important data')
    f.flush()
    os.fsync(f.fileno())

Mistake #2: "rename() is atomic, so I don't need fsync"

Why it's wrong: rename() is atomic for the operation itself, but both the file contents AND the directory entry need to be synced.

# WRONG - rename without syncing
with open('file.tmp', 'w') as f:
    f.write(data)
os.rename('file.tmp', 'file.txt')  # Might lose everything!

# RIGHT - sync everything
with open('file.tmp', 'w') as f:
    f.write(data)
    f.flush()
    os.fsync(f.fileno())
os.rename('file.tmp', 'file.txt')
dir_fd = os.open('.', os.O_RDONLY)
os.fsync(dir_fd)  # Sync directory!
os.close(dir_fd)

Mistake #3: "My database handles durability, I don't need to worry"

Why it's wrong: Default settings vary. Many databases default to performance over durability.

Check your settings:

-- PostgreSQL
SHOW fsync;           -- Should be 'on'
SHOW synchronous_commit;  -- 'on' for full durability

-- MySQL
SHOW VARIABLES LIKE 'innodb_flush_log_at_trx_commit';
-- 1 = safest, 2 = flush once per second, 0 = no flush

Decision Framework

□ Is this data replaceable?
  → Yes: Buffered writes are fine
  → No: Need fsync

□ What's the acceptable data loss window?
  → 0 seconds: fsync every write
  → 1 second: fsync every second (batch)
  → 30 seconds: OS default
  → Don't care: Pure buffer

□ What's my write throughput requirement?
  → < 1,000/sec: fsync every write is fine
  → 1,000-100,000/sec: Batch writes, fsync periodically
  → > 100,000/sec: Consider async durability, accept some loss

□ Is this a database?
  → Always enable WAL/journaling
  → Check your durability settings
  → Test with actual power-off!

Performance Numbers to Know

Operation	Typical Latency	Notes
write() to page cache	~1 μs	Just memory copy
fsync() to SSD	~100-500 μs	Actual I/O
fsync() to HDD	~5-15 ms	Mechanical seek
O_DIRECT write to SSD	~50-100 μs	Bypass cache

Rule of thumb: fsync() is 100-1000x slower than buffered write.

Memory Trick

"WASD" for write durability:

Write: Puts data in kernel buffer (not safe!)
All the way: fsync() gets it to disk
Sync directory: Don't forget metadata
Double-check: Verify with your specific hardware

Self-Assessment

Before moving on:

[ ] Can you explain what happens between write() returning and data being on disk?
[ ] Know when O_DIRECT is beneficial vs harmful?
[ ] Could you implement an atomic file write that survives crashes?
[ ] Understand why rename() alone isn't enough for durability?
[ ] Know your database's default durability settings?
[ ] Understand why close() doesn't guarantee durability?
[ ] Know when to use atomic write patterns for AI model checkpoints?

Key Takeaways

write() lies: Returns success when data is in kernel buffer, not on disk
fsync() is truth: Only guarantee of durability, but expensive
Atomic writes: temp file + fsync + rename + fsync directory
Batch for performance: fsync once per second, not per write
Verify your database: Check durability settings, test with actual power loss

What's Next

Now that you understand how data flows from memory to disk, the next question is: How does the OS actually schedule all these operations?

In the next article, Article 4: CPU Scheduling & Context Switches, you'll learn:

How the OS decides which process runs when
Why context switches matter for performance
The trade-offs between throughput and latency
How to measure and optimize scheduling overhead

This connects directly to what you learned here—when fsync() blocks waiting for disk I/O, the OS switches to another process. Understanding scheduling helps you understand why your app might be slow even when CPU isn't busy.

→ Continue to Article 4: CPU Scheduling & Context Switches

This article is part of the Backend Engineering Mastery series. Understanding I/O is fundamental to building reliable systems. You learned about memory management in Article 2, and now you understand how data flows from that memory to disk.