What is a distributed systems engineer?

The longer answer

Distributed-systems engineering sits at the intersection of computer-science theory and operational pragmatism. The theoretical foundation matters because the failure modes of distributed systems are non-obvious — a system that works on one machine fails in ways that don\'t exist on one machine when you spread it across ten. The operational foundation matters because the theory only tells you what tradeoffs exist, not which tradeoffs to make for your specific workload.

Where distributed-systems engineers earn their billing rate

Three categories of work. Choosing the right primitives: when to use Kafka vs RabbitMQ vs NATS for message queueing, when to use etcd vs Consul vs ZooKeeper for coordination, when Postgres replication is enough vs when you need a real distributed database (CockroachDB, YugabyteDB, Spanner). Most teams pick wrong because they\'re evaluating against the wrong criteria. Designing for the failure modes that matter: knowing what happens when one node fails, what happens when the network partitions, what happens when a node comes back after being unreachable for 30 minutes, and choosing consistency / availability / partition-tolerance tradeoffs that fit the actual business requirements. Operating the result: observability across N machines is fundamentally different from observability on one — distributed tracing, structured logging across services, debugging across process boundaries.

When you need a distributed-systems engineer

Three honest tests. First, is your traffic actually too large for one machine? Modern hardware handles tens of thousands of requests per second on a single beefy box; many teams adopt distributed architectures before their traffic justifies it and pay a substantial engineering tax. Second, do you actually need cross-region or cross-data-center availability? If a single-region outage is tolerable, you don\'t need multi-region distributed coordination. Third, does your data model genuinely require distributed-database semantics? Most business applications can run on a single Postgres instance for years; the cases where they can\'t are usually high-volume real-time analytics, multi-region regulatory data residency, or specific financial / trading workloads.

The most common mistake

Adopting microservices because Netflix did. Netflix runs hundreds of millions of users and a multi-thousand-engineer organization; their architectural choices solve their problems, not yours. Most mid-market businesses are better served by a well-modularized monolith than by microservices. A senior distributed-systems engineer will tell you that before designing your distributed architecture.

Common follow-up questions

When do I need a distributed system?

When a single machine genuinely can't handle your traffic / data volume, when you need cross-region availability, or when your data model has specific distributed-database requirements (financial transactions across regions, regulatory data residency). Most teams adopt distributed architectures before they need them; the engineering tax is real.

Kafka vs RabbitMQ vs NATS — which queue?

Kafka for high-throughput event streaming and event-sourcing workloads where you need replayable logs. RabbitMQ for traditional message-queue workloads with complex routing. NATS for low-latency request/reply or pub/sub at the edge. Most business applications start with RabbitMQ or NATS and only move to Kafka when the use case actually demands it.

Should I use a single Postgres or a distributed database?

Default to a single Postgres for as long as it scales (it scales further than most teams think — sustained workloads of 10,000+ writes/sec on commodity hardware are achievable with reasonable tuning). Move to a distributed database (CockroachDB, YugabyteDB, Spanner) only when you hit specific limits Postgres can't address: true multi-region writes, horizontal scale beyond what vertical scale handles, or specific regulatory requirements.