11 min read

#3. File I/O & Durability - Why fsync() Is Your Best Friend (And Worst Enemy)

#3. File I/O & Durability - Why fsync() Is Your Best Friend (And Worst Enemy)

The One Thing to Remember

fsync() is your durability guarantee. Without it, your "written" data might just be sitting in kernel buffers, vulnerable to power loss. With it, you trade performance for safety.

But here's the scary part: Ever had write() return success, only to lose data on a power outage? That's because write() lies—it returns when data is in memory, not on disk.


Building on Article 2

In Article 2: Memory Management, you learned about the page cache—that's where your data goes when you call write(). But here's the critical question: How does data actually get from that page cache to disk?

The journey from your application's memory to physical storage is full of trade-offs. Understanding this path is the difference between losing hours of data and having true durability.

Previous: Article 2 - Memory Management


Why This Matters (A True Horror Story)

I once watched a startup lose 2 hours of user data after a power outage. They thought their database was durable—it wasn't. The database was using default settings that prioritized performance over durability. When the power went out, all those "written" records were still in kernel buffers. Gone.

This isn't academic knowledge—it's the difference between:

  • Losing data vs keeping it safe

    • Understanding fsync() = you know when data is truly durable
    • Not understanding = you think data is safe, but it's not
  • Debugging in hours vs days

    • Knowing the write path = you immediately check fsync() calls
    • Not knowing = you blame the database, the network, everything except the actual problem
  • Choosing the right durability strategy

    • Understanding trade-offs = you pick fsync frequency that matches your needs
    • Not understanding = you either lose data or kill performance

Understanding the write path is the difference between "I think the data is saved" and "I know the data is saved."


Quick Win: Check Your Write Durability

Before we dive deeper, let's see if your system is actually syncing data:

# See how much data is waiting to be written (dirty pages)
cat /proc/meminfo | grep -E "Dirty|Writeback"

# Watch dirty pages in real-time during writes
watch -n 1 'cat /proc/meminfo | grep Dirty'

# Check if a process is calling fsync()
strace -e fsync,fdatasync -p $(pgrep your-app | head -1)

What to look for:

  • High Dirty pages: Lots of data waiting to be written (potential data loss risk)
  • No fsync() calls: Your app might not be syncing (data loss on crash!)
  • Frequent fsync(): Good for durability, but might be slow

The Write Path: From App to Disk

When you call write(), your data takes a long journey:

APPLICATION                    KERNEL                         HARDWARE
┌─────────────────┐        ┌─────────────────┐            ┌─────────────────┐
│                 │        │                 │            │                 │
│  Application    │        │   Page Cache    │            │  Disk           │
│  Buffer         │        │   (RAM)         │            │  (Persistent)   │
│                 │        │                 │            │                 │
│  "Hello World"  │───────►│  "Hello World"  │───────────►│  "Hello World"  │
│                 │ write()│                 │   Later    │                 │
│                 │        │                 │  (maybe)   │                 │
└─────────────────┘        └─────────────────┘            └─────────────────┘
                                   │                              │
                                   │                              │
                            [ VULNERABLE ]                 [ DURABLE ]
                            Power loss here                Power loss here
                            = data LOST                    = data SAFE

The Four Stages of a Write

Stage Location Speed Durability
1. App buffer Your process memory Instant Lost on crash
2. Page cache Kernel memory Fast (~μs) Lost on power loss
3. Disk cache Drive's own RAM N/A Lost on power loss*
4. Disk platters/cells Physical storage Slow (~ms) Durable

*Enterprise SSDs with capacitors can flush their cache on power loss

Quick Jargon Buster

  • Page Cache: Kernel's memory buffer for file data (from Article 2!)
  • fsync(): System call that forces data from page cache to disk
  • fdatasync(): Like fsync() but only syncs data, not metadata (faster)
  • O_DIRECT: Flag to bypass page cache entirely (advanced, used by databases)
  • Dirty Pages: Modified pages in cache that haven't been written to disk yet
  • Write-Ahead Log (WAL): Write operations to log first, then apply (databases use this)

Visual: Write Modes Compared

BUFFERED WRITE (Default - DANGEROUS for important data):
═══════════════════════════════════════════════════════

   App ──► write() ──► Page Cache ──► [sometime later...] ──► Disk
                              │
                              └──► write() returns SUCCESS immediately!
   
   Timeline:
   ────────────────────────────────────────────────────────►
   │         │                                      │
   write()   Returns                               Actually on disk
   called    "success"                             (30 sec later??)
   
   ⚠️  Power loss between write() and disk sync = DATA LOST


WRITE + FSYNC (SAFE):
════════════════════

   App ──► write() ──► Page Cache ──► fsync() ──► Disk ──► Returns
                                                              │
                                                              │
   Timeline:                                                  │
   ────────────────────────────────────────────────────────►  │
   │         │                        │                       │
   write()   In page cache           fsync()                  Success!
   called                            waits for disk           Data is safe
   
   ✅ When fsync() returns, data is on physical storage

Common Mistakes (I've Made These)

Mistake #1: "Closing a file flushes it to disk"

Why it's wrong: close() flushes Python/C buffers to OS, not OS buffers to disk. I've lost data because of this assumption.

# WRONG - Not durable!
with open('file.txt', 'w') as f:
    f.write('important data')
# File closed, but data may still be in page cache!

# RIGHT - Durable
with open('file.txt', 'w') as f:
    f.write('important data')
    f.flush()
    os.fsync(f.fileno())

Mistake #2: "rename() is atomic, so I don't need fsync"

Why it's wrong: rename() is atomic for the operation itself, but both the file contents AND the directory entry need to be synced. I've seen this cause data loss.

# WRONG - rename without syncing
with open('file.tmp', 'w') as f:
    f.write(data)
os.rename('file.tmp', 'file.txt')  # Might lose everything!

# RIGHT - sync everything
with open('file.tmp', 'w') as f:
    f.write(data)
    f.flush()
    os.fsync(f.fileno())
os.rename('file.tmp', 'file.txt')
dir_fd = os.open('.', os.O_RDONLY)
os.fsync(dir_fd)  # Sync directory!
os.close(dir_fd)

Mistake #3: "My database handles durability, I don't need to worry"

Why it's wrong: Default settings vary. Many databases default to performance over durability. Always check your database's durability settings.

Check your settings:

-- PostgreSQL
SHOW fsync;           -- Should be 'on'
SHOW synchronous_commit;  -- 'on' for full durability

-- MySQL
SHOW VARIABLES LIKE 'innodb_flush_log_at_trx_commit';
-- 1 = safest, 2 = flush once per second, 0 = no flush

Trade-offs: The Durability-Performance Spectrum

The Fundamental Trade-off

┌─────────────────────────────────────────────────────────────────────┐
│                DURABILITY ◄───────────────────────► PERFORMANCE     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  fsync every       fsync every       fsync on         No fsync     │
│    write            second          close only        (buffered)   │
│     │                  │                │                │         │
│     ▼                  ▼                ▼                ▼         │
│  ┌───────┐        ┌───────┐        ┌───────┐        ┌───────┐     │
│  │~1,000 │        │~50,000│        │~200,000│       │~500,000│    │
│  │writes │        │writes │        │writes  │       │writes  │    │
│  │/sec   │        │/sec   │        │/sec    │       │/sec    │    │
│  └───────┘        └───────┘        └───────┘        └───────┘     │
│                                                                     │
│  Zero data loss   Lose ~1 sec      Lose data since   Lose all     │
│  on power loss    on power loss    last close        buffered data│
│                                                                     │
│  Use: Databases,  Use: Logs,       Use: Temp files,  Use: Caches, │
│  Financial txns   important data   build artifacts   scratch data │
└─────────────────────────────────────────────────────────────────────┘

Redis AOF Persistence: The Perfect Case Study

Redis demonstrates this trade-off perfectly with its appendfsync options:

┌─────────────────────────────────────────────────────────────────┐
│                 REDIS AOF PERSISTENCE OPTIONS                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  appendfsync always        appendfsync everysec    appendfsync no│
│  ─────────────────        ─────────────────────   ──────────────│
│                                                                 │
│  fsync() after EVERY      fsync() once per        fsync() never │
│  write command            second                  (OS decides)  │
│                                                                 │
│  Performance: ~1K ops/s   Performance: ~100K/s    Performance: Max│
│  Data loss: None          Data loss: ~1 second    Data loss: 30s+│
│                                                                 │
│  Use for:                 Use for:                Use for:      │
│  - Financial data         - Most use cases        - Pure cache  │
│  - Can't lose anything    - Good balance          - Replaceable │
└─────────────────────────────────────────────────────────────────┘

The numbers are shocking: On HDD storage, Redis with appendfsync always achieved only ~49 ops/s, compared to ~97,536 ops/s with everysec—that's a 2000x difference! Even on NVMe (Samsung 960 PRO), always mode yielded only **~449 ops/s`.

Production recommendation: For most production environments, appendfsync everysec is the standard. It provides adequate durability (accepts up to 1 second of data loss) without the severe performance penalty.

References:


Code Examples

Unsafe vs Safe Write

import os
import time

def unsafe_write(filename, data):
    """
    Data might be lost on power failure!
    write() returns when data is in page cache, NOT on disk.
    """
    with open(filename, 'w') as f:
        f.write(data)
    # File closed, but data might still be in kernel buffers!

def safe_write(filename, data):
    """
    Data survives power failure.
    """
    with open(filename, 'w') as f:
        f.write(data)
        f.flush()        # Flush Python's internal buffer to OS
        os.fsync(f.fileno())  # Flush OS buffer to disk
    
    # IMPORTANT: Also sync the directory for the file metadata!
    dir_fd = os.open(os.path.dirname(os.path.abspath(filename)), os.O_RDONLY)
    os.fsync(dir_fd)
    os.close(dir_fd)

The Atomic Write Pattern (Gold Standard)

import os
import tempfile

def atomic_write(filepath, data):
    """
    The safest write pattern:
    1. Write to temp file
    2. fsync temp file
    3. Rename (atomic on POSIX)
    4. fsync directory
    
    At no point is the file in a partially-written state!
    """
    directory = os.path.dirname(os.path.abspath(filepath))
    
    # Create temp file in same directory (important for rename!)
    fd, tmp_path = tempfile.mkstemp(dir=directory)
    
    try:
        # Write data to temp file
        os.write(fd, data.encode() if isinstance(data, str) else data)
        
        # Ensure data is on disk
        os.fsync(fd)
        os.close(fd)
        
        # Atomic rename - either completes fully or not at all
        os.rename(tmp_path, filepath)
        
        # Sync directory to persist the rename
        dir_fd = os.open(directory, os.O_RDONLY)
        os.fsync(dir_fd)
        os.close(dir_fd)
        
    except:
        # Clean up temp file on error
        os.close(fd)
        os.unlink(tmp_path)
        raise

# Usage
atomic_write('/tmp/important.txt', 'critical data')

# Even if power fails during this:
# - Old file is intact, OR
# - New file is complete
# NEVER a partially-written file!

Real-World Trade-off Stories

PostgreSQL's fsync() Disaster (2018) - "fsyncgate"

Situation: In 2018, the PostgreSQL community discovered a critical vulnerability where kernel I/O error handling could result in silent data loss. This became known as "fsyncgate."

What happened:

  • PostgreSQL assumes that a successful fsync() call means all data written since the last successful call has reached persistent storage
  • When buffered I/O writes fail due to hardware errors, the kernel discards the affected pages and marks them clean
  • When PostgreSQL retried fsync() after an initial failure, the retry succeeded because the kernel had already cleared the error flag—but the data was never actually written to disk
  • This was particularly problematic on XFS (which lacks error-remount behavior), network block devices, thin-provisioned storage, and multipath I/O

The fix: PostgreSQL was patched to PANIC on fsync() failures rather than retry, preventing silent data corruption. This fix was backported to PostgreSQL 11, 10, 9.6, 9.5, and 9.4. Linux kernel 4.13 also improved fsync() error handling with new writeback error reporting infrastructure.

Impact: Similar fixes were adopted by InnoDB/MySQL and MongoDB's WiredTiger. The incident highlighted fundamental issues with how operating systems and applications handle I/O errors that persist across POSIX specifications and multiple operating systems.

References:

Lesson: Even "safe" operations can have edge cases. Defense in depth matters. Don't assume fsync() success means your data is safe—check for errors and handle them appropriately.

MongoDB's Controversial Default (Data Loss Risk)

Situation: Early MongoDB defaulted to unacknowledged writes and weak read concerns. Multiple companies lost data during crashes because they didn't understand the defaults.

What users expected: "Write complete" = data is safe
What actually happened: "Write complete" = data is in memory (on primary only)

The problem:

  • Default write concern is { w: 1 } - acknowledges after saving to primary only, without waiting for replication
  • If the primary crashes before replication completes, the data is rolled back and lost
  • Default read concern is local - provides no guarantee that read data won't be rolled back in the future
  • This creates a "fire and forget" scenario where data can "just vanish into thin air"

The solution: Set both write concern and read concern to "majority":

// Old dangerous default
db.users.insert({name: "Alice"})  // Returns before durable!

// New safer approach
db.users.insert({name: "Alice"}, {
    writeConcern: { w: "majority", j: true }
})
// w: "majority" = acknowledged by majority of nodes
// j: true = journaled (fsync'd)

Why aren't safer defaults used? MongoDB chose weaker defaults for performance reasons—failovers are infrequent enough that most deployments don't experience significant data loss in practice. However, businesses prioritizing safety should explicitly configure stronger consistency levels.

References:

Lesson: Understand your database's durability guarantees. Don't assume. Always check default settings and configure them for your use case. Test with actual power loss if possible.

Write Amplification and SSD Wear

Situation: Applications that sync user state to disk very frequently can cause write amplification on SSDs, leading to premature wear.

The problem:

  • Each small write triggers a full 4KB page write (SSD write unit)
  • Many small writes = many 4KB writes = "write amplification"
  • SSDs must erase in block units, while host writes occur in smaller page units
  • When valid data density drops, the SSD controller performs garbage collection, creating additional writes
  • This impacts throughput, latency, and accelerates SSD wear

Solutions:

  • Batch small writes into larger ones before syncing
  • Use write-ahead log pattern - write to log, checkpoint periodically
  • Group commit - combine multiple writes into a single batch with one fsync (RocksDB does this)
  • Use O_DIRECT_NO_FSYNC - only fsync when filesystem metadata updates are necessary

For databases: When using innodb_flush_method = O_DIRECT, databases like InnoDB call fsync after each batch of writes rather than per-individual write, reducing fsync frequency.

References:

Lesson: fsync() is expensive, and frequent small fsyncs can wear out SSDs. Batch operations when possible. Use group commit patterns for high-throughput systems.


Common Confusions (Cleared Up)

"write() returns, so data is on disk!"

Reality: write() returns when data is in the page cache (kernel memory), not on disk. It might be written later, or it might be lost on power failure.

"close() flushes data to disk!"

Reality: close() flushes your application's buffer to the OS page cache, but doesn't sync to disk. You still need fsync() for durability.

"rename() is atomic, so I'm safe!"

Reality: rename() is atomic for the operation, but you still need to:

  1. fsync() the file contents before rename
  2. fsync() the directory after rename

Otherwise, the rename might complete but the file contents could be lost.

"My database handles this, I don't need to worry!"

Reality: Database defaults vary. Many default to performance over durability. Always check your database's durability settings and test with actual power loss.


Debugging I/O Issues

See Dirty Pages (Unflushed Data)

# How much data is waiting to be written to disk?
cat /proc/meminfo | grep -E "Dirty|Writeback"

Dirty:           12345 kB   # Data modified, not yet written
Writeback:           0 kB   # Data currently being written

# High Dirty = lots of buffered writes
# Watch it during your application's writes
watch -n 1 'cat /proc/meminfo | grep Dirty'

Monitor Disk I/O

# Real-time I/O statistics
iostat -x 1

Device   r/s     w/s     await  %util
sda      100     500     2.5    45%

# Key metrics:
# w/s = writes per second
# await = average wait time (ms) - should be < 10ms for SSD
# %util = how busy the disk is - 100% = saturated

Trace System Calls

# See what a process is writing
strace -e write,fsync,fdatasync -p <PID>

# Example output:
write(3, "data here...", 100) = 100
fsync(3) = 0   # <-- This is where durability happens!

Decision Framework

□ Is this data replaceable?
  → Yes: Buffered writes are fine
  → No: Need fsync

□ What's the acceptable data loss window?
  → 0 seconds: fsync every write
  → 1 second: fsync every second (batch)
  → 30 seconds: OS default
  → Don't care: Pure buffer

□ What's my write throughput requirement?
  → < 1,000/sec: fsync every write is fine
  → 1,000-100,000/sec: Batch writes, fsync periodically
  → > 100,000/sec: Consider async durability, accept some loss

□ Is this a database?
  → Always enable WAL/journaling
  → Check your durability settings
  → Test with actual power-off!

Performance Numbers to Know

Operation Typical Latency Notes
write() to page cache ~1 μs Just memory copy
fsync() to SSD ~100-500 μs Actual I/O
fsync() to HDD ~5-15 ms Mechanical seek
O_DIRECT write to SSD ~50-100 μs Bypass cache

Rule of thumb: fsync() is 100-1000x slower than buffered write.


Memory Trick

"WASD" for write durability:

  • Write: Puts data in kernel buffer (not safe!)
  • All the way: fsync() gets it to disk
  • Sync directory: Don't forget metadata
  • Double-check: Verify with your specific hardware

Self-Assessment

Before moving on:

  • [ ] Can you explain what happens between write() returning and data being on disk?
  • [ ] Know when O_DIRECT is beneficial vs harmful?
  • [ ] Could you implement an atomic file write that survives crashes?
  • [ ] Understand why rename() alone isn't enough for durability?
  • [ ] Know your database's default durability settings?
  • [ ] Understand why close() doesn't guarantee durability?

Key Takeaways

  1. write() lies: Returns success when data is in kernel buffer, not on disk
  2. fsync() is truth: Only guarantee of durability, but expensive (100-1000x slower)
  3. Atomic writes: temp file + fsync + rename + fsync directory
  4. Batch for performance: fsync once per second, not per write (Redis everysec pattern)
  5. Verify your database: Check durability settings, test with actual power loss
  6. Even fsync() can fail: PostgreSQL's 2018 bug showed that even "safe" operations have edge cases

What's Next

Now that you understand how data flows from memory to disk, the next question is: How does the OS actually schedule all these operations?

In the next article, CPU Scheduling & Context Switches - Throughput vs Latency, you'll learn:

  • How the OS decides which process runs when
  • Why context switches matter for performance
  • The trade-offs between throughput and latency
  • How to measure and optimize scheduling overhead

This connects directly to what you learned here—when fsync() blocks waiting for disk I/O, the OS switches to another process. Understanding scheduling helps you understand why your app might be slow even when CPU isn't busy.

Continue to Article 4: CPU Scheduling & Context Switches


This article is part of the Backend Engineering Mastery series. Understanding I/O is fundamental to building reliable systems. You learned about memory management in Article 2, and now you understand how data flows from that memory to disk.