Backend Engineering Mastery - Complete Article Series
A comprehensive guide for backend engineers, engineering managers, and principal engineers
27 Core Articles | ~50,000 words | 8+ hours of reading
Plus: System Design Mastery Series (4 articles)
Series Philosophy
After 10 years of building and breaking systems in production, I've learned one thing: the best engineers don't memorize facts—they understand trade-offs.
This series isn't about theory. It's about the decisions you'll face at 2 AM when your system is on fire. Each article includes:
- The One Thing to Remember - One insight that changes how you think
- Why This Matters - Real production incidents I've seen (and caused)
- Visual Models - ASCII diagrams you can draw on a whiteboard
- Trade-off Analysis - Decision frameworks, not just definitions
- Code Examples - Runnable snippets that demonstrate the concept
- Real-World Stories - War stories from companies you know
- Self-Assessment - Verify you actually understand, not just memorized
Quick Navigation
Jump to a section:
- Part 1: OS & Systems Foundation (4 articles)
- Part 2: Networking (3 articles)
- Part 3: Storage & Databases (4 articles)
- Part 4: Distributed Systems (5 articles)
- Part 5: Production Engineering (4 articles)
- Part 6: Cloud-Native & Modern Patterns (4 articles)
- Part 7: Engineering Leadership (3 articles)
- System Design Mastery Series (4 articles)
Part 1: OS & Systems Foundation (Articles 1-4)
Understanding the operating system layer that everything runs on. Skip this at your own peril—I've seen too many engineers blame the database when the real problem was in the OS.
| # | Article | Key Trade-off | Time | Link |
|---|---|---|---|---|
| 01 | Process vs Thread | Isolation vs Efficiency | 10 min | Read Article → |
| 02 | Memory Management | Virtual vs Physical, Swap vs OOM | 10 min | Read Article → |
| 03 | File I/O & Durability | Performance vs Durability (fsync) | 10 min | Read Article → |
| 04 | CPU Scheduling & Context Switches | Throughput vs Latency | 10 min | Read Article → |
After Part 1, you'll understand: Why processes are isolated, how memory really works (not what you think), what fsync actually does, and why context switches can kill your performance.
Next: Part 2: Networking →
Part 2: Networking (Articles 5-7)
How data moves between machines—the foundation of distributed systems. This is where most engineers get confused, and I don't blame them.
| # | Article | Key Trade-off | Time | Link |
|---|---|---|---|---|
| 05 | TCP Deep Dive | Reliability vs Latency | 12 min | Read Article → |
| 06 | HTTP Evolution (1.1→2→3) | Simplicity vs Performance | 11 min | Read Article → |
| 07 | Load Balancing (L4 vs L7) | Speed vs Features | 11 min | Read Article → |
After Part 2, you'll understand: TCP states and how to debug connection issues, why HTTP/3 uses UDP (it's not what you think), and when to use L4 vs L7 load balancers.
Next: Part 3: Storage & Databases →
Part 3: Storage & Databases (Articles 8-11)
How data is stored, indexed, and queried efficiently. I've lost count of how many "slow query" issues were actually index problems.
| # | Article | Key Trade-off | Time | Link |
|---|---|---|---|---|
| 08 | Database Indexes Deep Dive | Read Speed vs Write Speed | 11 min | Read Article → |
| 09 | ACID Transactions Explained | Consistency vs Performance | 10 min | Read Article → |
| 10 | Isolation Levels & Anomalies | Safety vs Concurrency | 10 min | Read Article → |
| 11 | SQL vs NoSQL Decision Guide | Flexibility vs Scale | 10 min | Read Article → |
After Part 3, you'll understand: Why indexes slow writes (and when that's okay), what SERIALIZABLE actually means (hint: it's not what most people think), and when to choose NoSQL (it's rarer than you think).
Next: Part 4: Distributed Systems →
Part 4: Distributed Systems (Articles 12-16)
Scaling beyond a single machine—where things get interesting. This is where I've made the most mistakes, and learned the most.
| # | Article | Key Trade-off | Time | Link |
|---|---|---|---|---|
| 12 | CAP Theorem Demystified | Consistency vs Availability | 10 min | Read Article → |
| 13 | Sharding Strategies | Query Flexibility vs Scale | 11 min | Read Article → |
| 14 | Replication Patterns | Consistency vs Latency | 10 min | Read Article → |
| 15 | Consensus & Raft | Availability vs Strong Consistency | 12 min | Read Article → |
| 16 | Time, Clocks & Ordering | Simplicity vs Accuracy | 10 min | Read Article → |
After Part 4, you'll understand: What CAP really means (most people get it wrong), how to choose shard keys (this decision will haunt you), leader election, and why distributed time is harder than it should be.
Next: Part 5: Production Engineering →
Part 5: Production Engineering (Articles 17-20)
Running systems reliably in production. This is where theory meets reality, and reality usually wins.
| # | Article | Key Trade-off | Time | Link |
|---|---|---|---|---|
| 17 | Reliability Patterns | Availability vs Complexity | 11 min | Read Article → |
| 18 | Caching Strategies | Performance vs Consistency | 10 min | Read Article → |
| 19 | Observability (Metrics, Logs, Traces) | Coverage vs Overhead | 11 min | Read Article → |
| 20 | Security Fundamentals | Security vs Convenience | 10 min | Read Article → |
After Part 5, you'll understand: Circuit breakers (and when they backfire), cache invalidation (the two hard things in CS), the RED/USE methods, and OAuth2 flows (without the confusion).
Next: Part 6: Cloud-Native & Modern Patterns →
Part 6: Cloud-Native & Modern Patterns (Articles 21-24)
Building for the cloud era. Containers, orchestration, and message queues—the tools that make distributed systems manageable.
| # | Article | Key Trade-off | Time | Link |
|---|---|---|---|---|
| 21 | Containers & Docker | Isolation vs Overhead | 10 min | Read Article → |
| 22 | Kubernetes Essentials | Abstraction vs Complexity | 12 min | Read Article → |
| 23 | Message Queues (Kafka vs RabbitMQ) | Throughput vs Latency | 10 min | Read Article → |
| 24 | Event-Driven Architecture | Decoupling vs Complexity | 11 min | Read Article → |
After Part 6, you'll understand: Container best practices (and anti-patterns), K8s core concepts (without the marketing), when to use Kafka vs RabbitMQ (they're not interchangeable), and event-driven architecture (when it helps, when it hurts).
Next: Part 7: Engineering Leadership →
Part 7: Engineering Leadership (Articles 25-27)
Skills for senior engineers, managers, and principal engineers. This is what separates good engineers from great ones.
| # | Article | Audience | Time | Link |
|---|---|---|---|---|
| 25 | Architecture Decision Records | Senior+ | 12 min | Read Article → |
| 26 | Technical Debt Strategy | Manager/Principal | 11 min | Read Article → |
| 27 | Build vs Buy Decisions | Principal/Director | 10 min | Read Article → |
After Part 7, you'll know: How to document decisions (so future you thanks past you), manage tech debt strategically (not reactively), and make build vs buy choices (without regret).
System Design Mastery Series (Separate Series)
Applying everything you've learned to real design problems. This is a separate series because system design deserves its own deep dive.
Note: The System Design Mastery series builds on the Backend Engineering Mastery series. I recommend completing Parts 1-4 before diving into system design.
| # | Article | Focus | Time | Link |
|---|---|---|---|---|
| SD-01 | System Design Framework | 5-step approach for any problem | 13 min | Read Article → |
| SD-02 | Design: URL Shortener | Simple, scalable system | 10 min | Read Article → |
| SD-03 | Design: Distributed Cache | High-performance caching | 11 min | Read Article → |
| SD-04 | Design: Real-Time Chat | WebSockets, ordering, fan-out | 12 min | Read Article → |
After the System Design series, you'll have: A repeatable framework for any system design problem, practice with common patterns, and the confidence to design systems that scale.
→ View System Design Mastery Series Index
Reading Paths by Role
Junior Engineer (0-2 years)
Focus: Foundations first. Don't skip the basics—I've seen too many engineers try to learn distributed systems without understanding processes and threads.
Week 1-2: OS Foundation (CRITICAL)
├── 01. Process vs Thread
├── 02. Memory Management
├── 03. File I/O & Durability
└── 04. CPU Scheduling
Week 3-4: Networking
├── 05. TCP Deep Dive
├── 06. HTTP Evolution
└── 07. Load Balancing
Week 5-6: Database Fundamentals
├── 08. Database Indexes
├── 09. ACID Transactions
└── 10. Isolation Levels
Week 7-8: Production Patterns
├── 17. Reliability Patterns
├── 18. Caching Strategies
└── 19. Observability
Mid-Level Engineer (2-5 years)
Focus: Distributed systems and system design. This is where you level up.
Week 1: Foundation Review (skim if familiar)
├── 01-04 (OS & Systems)
└── 05-07 (Networking)
Week 2-3: Distributed Systems (MUST DO)
├── 12. CAP Theorem
├── 13. Sharding Strategies
├── 14. Replication Patterns
├── 15. Consensus & Raft
└── 16. Time & Ordering
Week 4: System Design Practice
├── SD-01. System Design Framework
├── SD-02. URL Shortener
├── SD-03. Distributed Cache
└── SD-04. Chat System
Week 5: Production & Cloud
├── 17-20 (Production Engineering)
└── 21-24 (Cloud-Native)
Senior Engineer (5+ years)
Focus: Depth, leadership, and system design mastery. You know the basics—now master the trade-offs.
Week 1: Distributed Systems Mastery
├── 12-16 (all distributed systems)
└── Focus on trade-off analysis
Week 2: System Design Excellence
├── SD-01 to SD-04 (all system design)
└── Practice explaining out loud
Week 3: Leadership Skills
├── 25. Architecture Decision Records
├── 26. Technical Debt Strategy
└── 27. Build vs Buy Decisions
Engineering Manager
Focus: Leadership articles + enough technical depth to guide teams. You don't need to code, but you need to understand the decisions.
Priority 1: Leadership Track
├── 25. Architecture Decision Records
├── 26. Technical Debt Strategy
└── 27. Build vs Buy Decisions
Priority 2: Key Technical Concepts
├── 12. CAP Theorem (for data decisions)
├── 17. Reliability Patterns (for SRE work)
└── SD-01. System Design Framework (for reviews)
Quick Reference: All Trade-offs
| Topic | Trade-off |
|---|---|
| Process vs Thread | Isolation vs Efficiency |
| Virtual Memory | Flexibility vs Page Fault Cost |
| fsync() | Durability vs Performance |
| Context Switches | Throughput vs Latency |
| TCP | Reliability vs Latency |
| HTTP versions | Simplicity vs Performance |
| L4 vs L7 LB | Speed vs Features |
| Indexes | Read Speed vs Write Speed |
| ACID | Consistency vs Performance |
| Isolation Levels | Safety vs Concurrency |
| SQL vs NoSQL | Flexibility vs Scale |
| CAP | Consistency vs Availability |
| Sharding | Query Flexibility vs Scale |
| Replication | Consistency vs Latency |
| Consensus | Availability vs Strong Consistency |
| Time/Clocks | Simplicity vs Accuracy |
| Circuit Breaker | Availability vs Complexity |
| Caching | Performance vs Consistency |
| Observability | Coverage vs Overhead |
| Security | Security vs Convenience |
| Containers | Isolation vs Overhead |
| Kubernetes | Abstraction vs Complexity |
| Kafka vs RabbitMQ | Throughput vs Latency |
| Event-Driven | Decoupling vs Complexity |
| Build vs Buy | Control vs Speed |
How to Use This Series
For Self-Study
- Read one article per day (or per sitting—don't rush)
- Run all "Try It Yourself" commands (actually do them, don't just read)
- Complete self-assessment checkboxes (be honest with yourself)
- Revisit after one week to reinforce (spaced repetition works)
- Teach concepts to someone else (best way to learn)
For Interview Prep
- Focus on System Design series (SD-01 to SD-04)
- Memorize trade-off tables in each article (interviewers love these)
- Practice drawing diagrams from memory (whiteboard skills matter)
- Explain concepts out loud (rubber duck method)
- Do 2-3 mock system design sessions (get feedback)
For Team Education
- Use as reading group material (1 article/week)
- Discuss trade-offs as a team (apply to your systems)
- Create team-specific examples (make it relevant)
- Build team ADR practice (Article 25)
Files Reference
| Article # | File Name | URL Slug |
|---|---|---|
| 01 | 01-process-vs-thread.md | process-vs-thread-the-foundation-every-backend-engineer |
| 02 | 02-memory-management.md | memory-management-demystified-virtual-memory-page-faults-performance |
| 03 | 03-file-io-durability.md | file-io-durability-why-fsync-is-your-best-friend-and-worst-enemy |
| 04 | 04-cpu-scheduling.md | cpu-scheduling-context-switches-throughput-vs-latency |
| 05 | 05-tcp-deep-dive.md | tcp-deep-dive-reliability-vs-latency |
| 06 | 06-http-evolution.md | http-evolution-1-1-2-3-simplicity-vs-performance |
| 07 | 07-load-balancing.md | load-balancing-l4-vs-l7-speed-vs-features |
| 08 | 08-database-indexes.md | database-indexes-deep-dive-read-speed-vs-write-speed |
| 09 | 09-acid-transactions.md | acid-transactions-explained-consistency-vs-performance |
| 10 | 10-isolation-levels.md | isolation-levels-anomalies-safety-vs-concurrency |
| 11 | 11-sql-vs-nosql.md | sql-vs-nosql-decision-guide-flexibility-vs-scale |
| 12 | 12-cap-theorem.md | cap-theorem-demystified-consistency-vs-availability |
| 13 | 13-sharding-strategies.md | sharding-strategies-query-flexibility-vs-scale |
| 14 | 14-replication-patterns.md | replication-patterns-consistency-vs-latency |
| 15 | 15-consensus-raft.md | consensus-raft-availability-vs-strong-consistency |
| 16 | 16-time-clocks-ordering.md | time-clocks-ordering-simplicity-vs-accuracy |
| 17 | 17-reliability-patterns.md | reliability-patterns-availability-vs-complexity |
| 18 | 18-caching-strategies.md | caching-strategies-performance-vs-consistency |
| 19 | 19-observability.md | observability-metrics-logs-traces-coverage-vs-overhead |
| 20 | 20-security-fundamentals.md | security-fundamentals-security-vs-convenience |
| 21 | 21-containers-docker.md | containers-docker-isolation-vs-overhead |
| 22 | 22-kubernetes-essentials.md | kubernetes-essentials-abstraction-vs-complexity |
| 23 | 23-message-queues.md | message-queues-kafka-vs-rabbitmq-throughput-vs-latency |
| 24 | 24-event-driven-architecture.md | event-driven-architecture-decoupling-vs-complexity |
| 25 | 25-architecture-decision-records.md | architecture-decision-records-making-decisions-legible |
| 26 | 26-technical-debt-strategy.md | technical-debt-strategy-intentional-debt-not-accidental |
| 27 | 27-build-vs-buy.md | build-vs-buy-decisions-control-vs-speed |
| SD-01 | ../system-design/01-system-design-framework.md | system-design-framework-5-step-approach-any-problem |
| SD-02 | ../system-design/02-design-url-shortener.md | design-url-shortener-simple-scalable-system |
| SD-03 | ../system-design/03-design-distributed-cache.md | design-distributed-cache-high-performance-caching |
| SD-04 | ../system-design/04-design-chat-system.md | design-real-time-chat-websockets-ordering-fan-out |
Contributing
Found an error? Have a better example? This series is continuously improved based on feedback from engineers who use it in production.
Congratulations on exploring the Backend Engineering Mastery series! This comprehensive guide covers everything from OS fundamentals to engineering leadership. Bookmark it, share it with your team, and return to it throughout your career.
Remember: understanding trade-offs beats memorizing facts every time.