Master API rate limiting using the token bucket algorithm to protect your multi-tenant SaaS. Learn to handle distributed traffic shaping with zero downtime.
During a recent on-call rotation, I watched a single tenant’s misconfigured cron job saturate our entire ingestion service. It wasn't a malicious attack, just a "noisy neighbor" scenario that effectively brought our platform to its knees for about 45 minutes. We had basic load balancing in place, but we lacked the granular control necessary to isolate usage at the tenant level.
If you’re running a multi-tenant SaaS, you know that "fairness" isn't just a policy—it’s an architectural requirement. Implementing robust API rate limiting is the only way to ensure one user's bursty traffic doesn't degrade the experience for everyone else.
We first tried a simple fixed-window counter. It was easy to implement in Redis, but it suffered from the "boundary problem." If a user sent all their allowed requests at the very end of one minute and the start of the next, they essentially doubled their quota in a tiny window. It caused massive spikes that our downstream services couldn't handle.
We switched to the token bucket algorithm. Unlike fixed windows, token buckets allow for controlled bursts. Imagine a bucket that holds a maximum number of tokens; every request consumes one. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected. This provides a natural form of traffic shaping that smooths out spikes while allowing legitimate users some breathing room.
In a multi-tenant architecture, you can't just apply a global limit. You need to bucketize your traffic based on a tenant ID. When a request hits our gateway, we extract the tenant_id from the JWT and perform a lookup.
Here is how we handle this in a distributed Node.js environment using Redis:
JAVASCRIPTasync function allowRequest(tenantId, bucketConfig) { const key = CE9178">`ratelimit:${tenantId}`; const now = Date.now(); // Lua script ensures atomicity in Redis const script = CE9178">` local bucket = redis.call('hmget', KEYS[1], 'tokens', 'last_refill') local tokens = tonumber(bucket[1] or ARGV[1]) local last_refill = tonumber(bucket[2] or ARGV[2]) local refill_rate = tonumber(ARGV[3]) local max_tokens = tonumber(ARGV[1]) local elapsed = math.max(0, (ARGV[2] - last_refill) / 1000) tokens = math.min(max_tokens, tokens + (elapsed * refill_rate)) if tokens >= 1 then redis.call('hmset', KEYS[1], 'tokens', tokens - 1, 'last_refill', ARGV[2]) return 1 else return 0 end `; return await redis.eval(script, 1, key, bucketConfig.max, now, bucketConfig.rate); }
By using a Lua script, we ensure the read-modify-write cycle is atomic. If you try to do this with separate GET and SET commands, you'll run into race conditions where two concurrent requests both think there's one token left.
Managing state across multiple nodes is the hardest part. If you have 20 pods running, you can't keep local in-memory buckets for every tenant—the state would be inconsistent.
We centralized our state in a dedicated Redis cluster. While this introduces a network hop, the latency is usually around 1-2ms, which is negligible compared to the downstream database calls. If you find your Redis cluster becoming the bottleneck, you might consider API rate limiting at the edge: protecting your downstream services to filter out obvious abuse before it even touches your primary application logic.
If you're dealing with high-volume microservices, consider these trade-offs:
Once you have your buckets in place, don't just return a 429 Too Many Requests. Be a good citizen. Return a Retry-After header so clients know when to back off. We've found that integrating API throttling: adaptive backoff strategies for resilient systems helps well-behaved clients recover gracefully without overwhelming the system again.
Also, remember that not all requests are created equal. A GET /health is vastly cheaper than a POST /reports/generate. We currently use different buckets for different route categories. It’s more complex to maintain, but it prevents a heavy report generation request from blocking a simple metadata fetch.
Looking back, we spent too much time trying to build a "one size fits all" configuration. In reality, some tenants have higher tiers and deserve higher burst capacity. We ended up hardcoding limits in our middleware, which meant a redeploy every time we wanted to change a tier.
Next time, I'd move the configuration into a dynamic store like etcd or even a cached database table. If you're currently struggling with spikes, start with a simple token bucket implementation, but keep your configuration decoupled from your code. It’s the difference between a 5-minute fix and a 2-hour deployment cycle.
Master API versioning and maintain backward compatibility in your distributed systems. Learn pragmatic strategies to evolve your services without breaking clients.