When Things Break : Reliability and Failure Handling in Distributed Systems

Let me tell you something nobody tells you when you're learning backend development: most of the code you write handles the case where everything goes right. The happy path. Clean inputs, responsive services, databases that respond in milliseconds. You build something, it works in staging, you ship it, and for a while, it's fine.

Then production happens.

A third-party payment gateway starts responding in 8 seconds instead of 200ms. An internal service starts dropping 5% of requests. A database connection pool exhausts under load. Suddenly your "working system" is a cascading pile of timeouts, angry users, and Slack pings at 2 AM.

A system that works in ideal conditions is not a production system.

I've been building distributed systems long enough to know that the interesting engineering isn't in the happy path. It's in what happens when things go wrong — because in production, they always do. This post is about what I've learned building systems like dispatchGO, flowpay, and holdUp: how failure happens, why it spreads, and the specific patterns that stop it from taking everything down.

The Mental Shift That Changes Everything

Early in your career, you think of failures as bugs. Something went wrong, you find it, you fix it. Failures are aberrations — things that shouldn't happen.

That framing breaks entirely once you're running services at any meaningful scale.

Network packets get dropped. DNS lookups fail intermittently. Third-party APIs go down, slow down, or return garbage. Databases get overloaded. Disks fill up. A noisy neighbor on shared infrastructure chews through your CPU quota. All of this is normal. Not in the sense that it's acceptable, but in the sense that it is expected and unavoidable.

The mental shift is this: reliability is not about preventing failure, but designing systems that survive it.

Once you internalize that, your entire approach to system design changes. You stop asking "what happens when this works?" and start asking "what happens when this breaks?" You plan for partial failures. You assume every external call might fail. You design your systems so that a failure in one place doesn't automatically become a failure everywhere.

This isn't pessimism. It's engineering.

Retry (With Exponential Backoff)

The most obvious response to a failing call is to try again. If a service didn't respond, maybe it'll respond if you ask once more. This intuition is correct — transient failures are real, and retrying them works.

But naive retry is dangerous. Here's why.

Imagine your service is under load and starts failing 10% of requests. If every caller retries immediately — three, four, five times — your actual request volume just multiplied. A service that was struggling at 100% capacity is now seeing 300% of requests. You've turned a degraded service into a dead one. This is called a retry storm, and it's one of the most common ways engineers accidentally take down a service they're trying to recover.

The solution is exponential backoff with jitter.

Exponential backoff means you wait longer between each retry attempt — first 100ms, then 200ms, then 400ms, then 800ms. Each failure doubles the wait. The idea is simple: if the service is struggling, give it time to breathe before you try again.

Jitter is the part people forget. If every client in your system is retrying with the same backoff schedule, they'll all retry at the same time. Synchronized retries still create spikes. Jitter adds randomness — instead of waiting exactly 400ms, you wait somewhere between 300ms and 500ms. It distributes the load across time.

In dispatchGO, when the service calls an external API and that call fails, we don't retry immediately. We back off, add jitter, and try again. The system stays stable even when the external service is flapping.

In flowpay, retrying failed payment processor calls is essential — network hiccups between services are common, and a dropped connection shouldn't mean a failed transaction. But those retries need to be controlled and spaced out.

One thing worth saying clearly: not every failure should be retried. A 400 Bad Request means you sent something wrong — retrying the same bad request will just fail again. A 503 Service Unavailable is worth retrying. Knowing the difference matters. Retries are for transient failures, not for logic errors.

Idempotency

Here's where a lot of systems develop subtle, nasty bugs.

You add retry logic to your payment service. It works great. Then one day you discover some users got charged twice. What happened? The payment processor received the request, processed the payment successfully, but the network dropped before it could send back the response. Your service, seeing a timeout, assumed failure and retried. The processor, having no way to know this was a retry, processed it again.

Retry without idempotency is a data corruption bug.

That's not an exaggeration. This class of bug is hard to catch in testing because it only appears when there's a failure mid-operation — which you can't reliably simulate. But in production, it will happen.

Idempotency means that calling the same operation multiple times produces the same result as calling it once. For a payment, that means the money moves exactly once, no matter how many times you try. For a database write, it means the record gets created once, not duplicated.

The standard approach is idempotency keys. When flowpay initiates a payment, it generates a unique key for that operation and sends it with the request. If the processor receives a request with a key it's already seen and processed, it returns the previous result instead of executing again. On our side, we store the key with the transaction state — so if we're the ones retrying, we check whether we already recorded a success before firing the request again.

This is one of those things that sounds simple and is mildly annoying to implement correctly. But it's not optional. Any system where retries and money (or state changes that must happen exactly once) coexist needs idempotency.

Timeouts

This one feels obvious, but the implications run deeper than most people realize.

The instinct to "wait a bit longer" when a service is slow seems harmless. But in a distributed system, waiting has a cost. Every request that's waiting is holding a goroutine, a thread, a connection from a pool. Your server has a finite number of those.

Here's a concrete scenario. Service A calls Service B. Service B is slow — maybe it's under load, maybe it's hitting a slow database query. Service A waits. While waiting, new requests come in to Service A. They also call Service B and start waiting. Your thread pool starts to fill up. New incoming requests to Service A start queuing. Eventually, Service A runs out of threads and begins rejecting requests. Service A is now effectively down — not because it's broken, but because it's waiting on Service B.

This is cascading failure, and it's how one slow service takes down your entire system.

The fix is straightforward: set timeouts everywhere. If a call to an external service doesn't complete in 2 seconds, cancel it. Fail fast, free the resources, return an error to the caller. A controlled failure is infinitely better than a hung system.

In dispatchGO, every call to an external service goes out with a timeout. We use Go's context.WithTimeout to enforce this — the context carries a deadline, and if the external call doesn't finish in time, the context cancels it automatically. I won't go deep into Go's context system here (that's its own post), but the important point is: the timeout is always set before the call is made, never assumed.

Circuit Breaker

Timeouts and retries help, but they have a failure mode: if Service B is completely down and you're retrying every failed request with backoff, you're still generating load against a broken service and still tying up resources waiting on calls that will never succeed.

The circuit breaker pattern solves this.

Think of it like an electrical circuit breaker in your house. When current gets too high — when something's wrong — the breaker trips and cuts the circuit. It doesn't keep trying. It stops the flow and waits for things to stabilize.

The pattern has three states:

Closed is the normal state. Requests flow through. Failures get counted. If failures stay below a threshold, nothing changes.

Open means the breaker has tripped. Too many failures happened in too short a window. In this state, the circuit breaker stops making calls to the broken service entirely. Requests fail immediately, without even attempting the network call. This is crucial — it frees resources and stops the cascade. From the outside, it looks like a fast failure rather than a slow one.

Half-Open is the recovery state. After some time in the Open state, the breaker allows a small number of requests through as a probe. If they succeed, the breaker closes and normal operation resumes. If they fail, it opens again and waits longer.

The power of this pattern is that it makes failure fast and deliberate. Instead of your service slowly drowning in timeouts and retries against a dead dependency, it fails instantly, recovers gracefully, and doesn't burn resources in the meantime.

This is especially important when Service B is something you don't control — a third-party API, a payment processor, a maps service. You can't fix their problems, but you can stop their problems from becoming your problems.

Rate Limiting

This one comes at the problem from a different angle. All the patterns above are about what you do when something you depend on fails. Rate limiting is about protecting yourself from being overwhelmed.

Your service has finite capacity. Under normal load, everything's fine. But what happens when you get a traffic spike? A thundering herd from a mobile app update? A misbehaving client that's hammering your API in a loop? Without rate limiting, your service either handles it (unlikely) or falls over.

Rate limiting says: I will serve requests up to a certain rate, and requests beyond that get rejected or queued.

In holdUp, we use token bucket middleware for rate limiting. The token bucket model works like this: there's a bucket with a maximum capacity of tokens. Every request consumes a token. Tokens are added back to the bucket at a steady rate. If the bucket is empty when a request comes in, the request is rejected.

This model handles two distinct concerns well. The steady rate — how many tokens are added per second — caps your sustained throughput. But the burst capacity — the size of the bucket — lets you handle short spikes. If traffic is normally low but occasionally spikes, a client can "spend" accumulated tokens on the burst. Once the bucket's empty, excess requests get dropped.

The key insight is that rate limiting isn't about being hostile to your users. It's about staying alive under load so you can serve all your users, not just the ones who happened to arrive before the service fell over.

How These Fit Together

Here's something I want to be clear about: these are not independent tools you can pick from a menu. They work together, and they fail dangerously if you use them in isolation.

Retry without idempotency is how you corrupt your data. Retry without backoff is how you create a retry storm. Retry without a circuit breaker is how you keep hammering a dead service and exhaust your resource pool. Timeouts without proper error handling mean you fail fast but still fail loudly.

The relationship looks like this: when a call fails, you retry — but with backoff so you don't create a spike. Before retrying, you check the circuit breaker — if it's open, you fail immediately and don't bother the broken service. When you do retry, you use idempotency keys so the operation only happens once. And every call has a timeout so no single slow dependency can hold your threads hostage.

Timeouts protect your resources. Backoff distributes load over time. Idempotency makes retries safe. Circuit breakers make failure fast. Rate limiting keeps the whole system within its operational envelope.

Pull any one of these out and the others become less effective, or actively harmful.

Concl

Anyone can build a system that works when everything goes right. That's the easy part. You write the code, it passes tests, it works in staging. Done.

The hard part — and the part that separates systems you'd trust with your users from systems you'd be embarrassed to deploy — is what happens under failure. When the payment processor goes down. When an external API starts timing out. When traffic spikes 10x and your service has to decide who it can still serve.

The difference between a demo and a production system is how it behaves when things go wrong.

Reliability isn't a feature you add at the end. It's a design philosophy you carry through every decision — every call site, every retry, every timeout value. It's the discipline of assuming things will break and building so that when they do, they break in controlled, recoverable ways.

Build systems that survive. That's the job.

When Things Break : Reliability and Failure Handling in Distributed Systems

The Mental Shift That Changes Everything

Retry (With Exponential Backoff)

Idempotency

Timeouts

Circuit Breaker

Rate Limiting

How These Fit Together

Concl

Comments

More from this blog

Why I Chose Token Bucket for HoldUp (And Why the Others Didn't Make the Cut)

How First Principles Finally Made Distributed Systems Click for Me

Command Palette

The Mental Shift That Changes Everything

Retry (With Exponential Backoff)

Idempotency

Timeouts

Circuit Breaker

Rate Limiting

How These Fit Together

Concl

Comments

More from this blog