Skip to main content

Command Palette

Search for a command to run...

Why I Chose Token Bucket for HoldUp (And Why the Others Didn't Make the Cut)

Updated
9 min read
B
Backend engineer building reliable distributed systems in Go. Event-driven architecture, gRPC services, and systems that hold up under pressure.

Most engineers add rate limiting because someone told them to, or because they got hit with an abuse spike and panicked. They slap on a library, set a number, and call it done. The question of which algorithm — and more importantly, why — barely comes up.

I've been in that position. But when I built HoldUp, a production-oriented rate limiter in Go, I had to actually sit down and think about it. Because the algorithm you choose isn't a cosmetic decision. It determines how your system behaves under pressure, whether your legitimate users get a good experience, and whether your service survives a bad one.

Rate limiting is not about blocking requests. It's about shaping traffic.

That framing changed how I approached the whole problem.


The Real Problem With Traffic

Here's what nobody tells you: systems don't usually fail because they get too much traffic. They fail because they get uncontrolled traffic. There's a difference.

Steady load is manageable. You can provision for it, scale for it, observe it. The thing that kills services is spikes — sudden bursts where 10x normal traffic arrives in two seconds. Your connection pool exhausts. Your database gets hammered. Latency spikes. A slow service becomes a dead one.

But here's the nuance that most naive rate limiters miss: not all spikes are abuse. A legitimate user might open your app after being offline for a while and generate a burst of requests. A mobile client might retry aggressively after recovering from a network drop. These are real usage patterns, and if your rate limiter can't distinguish between a legitimate burst and an abusive one, you end up punishing real users.

So the real requirements, as I saw them when building HoldUp, were:

  • Handle sustained load without letting any single client overwhelm the system

  • Allow short bursts because real users need them

  • Be fair without being rigid

  • Stay simple enough to reason about under production pressure

That last one matters more than people think. A rate limiter you can't debug at 2 AM is a liability.


What HoldUp Actually Does

HoldUp is a middleware-based rate limiter written in Go. The design goals were correctness under concurrency, predictable behavior, and simplicity. No background goroutines spinning in the dark, no complex state machines, no distributed coordination (at least for the initial version).

Each client gets a bucket. The bucket has a capacity and a refill rate, pulled from environment-based config. When a request comes in, the middleware checks the bucket, tries to consume a token, and either lets the request through or rejects it. The whole thing is mutex-protected — straightforward, easy to audit.

The refill is lazy. There's no ticker goroutine adding tokens in the background. Instead, when a request arrives, we calculate how many tokens should have accumulated since the last request based on elapsed time and add them then. Fewer moving parts, less to go wrong.

With a config like capacity = 10, refill = 1 token/sec, a new client can immediately make 10 requests in a burst, then sustain 1 request per second indefinitely. That's the behavior I wanted. The question was whether the algorithm matched the intent.


Choosing the Algorithm

There are four algorithms that come up every time rate limiting is discussed: Fixed Window, Sliding Window, Leaky Bucket, and Token Bucket. They all limit traffic, but they have meaningfully different behavior. Choosing between them is a trade-off decision, not a correctness decision.

Let me walk through how I thought about each one.


Fixed Window — Simple, But Has a Sharp Edge

Fixed Window is the obvious starting point. You pick a window — say, one minute — and allow N requests per window. Simple to implement, simple to explain.

The problem is what happens at the boundary.

Imagine your limit is 100 requests per minute. A client sends 100 requests at 11:59 and another 100 at 12:00. Both windows are fresh, both are valid. The client just sent 200 requests in two seconds and your limiter was fine with it. That boundary spike is a real problem for any system where the limit is supposed to mean something.

For HoldUp, this was a dealbreaker. I needed the limit to reflect actual sustained behavior, not just per-window accounting.


Sliding Window — More Accurate, More Expensive

Sliding Window fixes the boundary problem. Instead of resetting at fixed intervals, you look at the last N seconds from now and count requests in that rolling window. No boundary spikes, more accurate enforcement.

But accuracy has a cost. To implement sliding window properly, you need to track the timestamp of every recent request. As request volume grows, memory and computation grow with it. Implementations often approximate this with sliding window counters — bucketing requests into smaller sub-windows — which trades some accuracy back for performance.

For a distributed system with Redis, sliding window is often the right call. But HoldUp's initial scope was in-process, per-service rate limiting. The added complexity wasn't buying me enough to justify it. Sliding window is the right answer to a slightly different problem.


Leaky Bucket — Great Smoothing, Wrong Trade-off

Leaky Bucket is elegant in its own way. Requests go into a queue (the bucket) and are processed at a fixed, constant rate — like water dripping out of a hole. The output is perfectly smooth. No spikes, no bursts, just steady throughput.

If you're building something like a billing system or a queue processor where you need to enforce a strict output rate, Leaky Bucket is excellent. But for a user-facing HTTP API, it's the wrong mental model.

Real users aren't machines. They don't send requests at perfectly metered intervals. They open a page, load a bunch of resources at once, then go quiet for a while. Leaky Bucket would throttle that burst even though the user's average request rate is well within limits. You end up degrading the experience of legitimate users to protect against abuse patterns they're not exhibiting.

That's not the trade-off I wanted to make.


Token Bucket — Why It Won

Token Bucket is the one that matched my mental model of how real traffic works.

The idea is simple: there's a bucket that holds tokens. Each request consumes one. Tokens are added back at a steady rate, up to the bucket's capacity. If the bucket is empty, the request is rejected.

The key insight is that this naturally handles two separate concerns at once. The refill rate enforces the long-term average — a client can only sustain as many requests per second as the refill rate allows. The capacity allows short bursts — if a client has been quiet, tokens accumulate, and they can spend them all at once.

With HoldUp's config of capacity = 10, refill = 1/sec, a client who's been idle for 10 seconds walks in with a full bucket and can fire 10 requests immediately. After that burst, they're back to 1 request per second. That's the behavior I wanted: burst-tolerant but abuse-resistant.

It also maps cleanly onto HTTP semantics. When a bucket is empty, you return a 429 Too Many Requests. The client knows to back off. The Retry-After header tells them when their tokens will replenish. It's a complete, communicable contract.

And critically for HoldUp, it was implementable correctly in Go without fighting the concurrency model. A mutex-protected struct with a token count and a last-refill timestamp. The lazy refill logic is a handful of lines. There's nothing there that's hard to debug or reason about.


Where Token Bucket Is The Right Call

Token Bucket works best when you're building user-facing APIs where bursts are a feature, not a bug. Public APIs, mobile backends, anything where clients have natural bursty patterns. It gives you a clean lever to allow legitimate spikes while still enforcing sustainable long-term rates. It's also easy to tune — adjusting capacity and refill rate gives you direct, predictable control over behavior.

For HoldUp's use case — a Go middleware layer protecting HTTP services — it was the obvious fit once I thought through the alternatives.


Where It Might Not Be

I want to be honest about the limits, because ignoring them would be bad engineering.

Token Bucket as I've implemented it in HoldUp is in-process. Each instance of your service has its own set of buckets, with no coordination between instances. If you're running five replicas, a client could hit five times their rate limit by spreading requests across instances. For strict per-user enforcement at scale, you need a shared store — typically Redis — and that changes the implementation meaningfully.

For systems with very strict fairness requirements — where every user must get exactly their allocation and no more — the bursty nature of Token Bucket can also be a problem. If fairness at the microsecond level matters, Leaky Bucket or a more precise sliding window might be worth the complexity.

Token Bucket is not the universally correct answer. It's the right answer for a specific set of problems, which happen to be very common ones.


The Actual Lesson

When I set out to build HoldUp, I didn't start with "I'll use Token Bucket." I started with the behavior I needed — burst tolerance, sustained rate enforcement, simplicity — and worked backwards to the algorithm that delivered it.

That's the way to think about any algorithm selection. Not "which one is most sophisticated" but "which one behaves the way my system needs to behave." Fixed Window is not inferior to Token Bucket. Leaky Bucket is not worse than Sliding Window. They solve different problems, and reaching for the wrong one because it sounds more impressive is how you build systems that behave surprisingly in production.

Good engineering is not about using advanced algorithms. It's about choosing the right one for the problem.

HoldUp uses Token Bucket because the problem called for it. If the problem changes — distributed deployment, strict per-user fairness, metered output — the algorithm might need to change too. That's not a failure. That's the system evolving with its requirements.

Know what your system needs. Then choose accordingly.