ArchitectureJune 23, 20264 min read

API Rate Limiting with Token Bucket Algorithms for Multi-Tenant SaaS

Master API rate limiting using the token bucket algorithm to protect your multi-tenant SaaS. Learn to handle distributed traffic shaping with zero downtime.

API DesignRate LimitingDistributed SystemsRedisSaaSBackend EngineeringAPIArchitectureBackendSystem Design

During a recent on-call rotation, I watched a single tenant’s misconfigured cron job saturate our entire ingestion service. It wasn't a malicious attack, just a "noisy neighbor" scenario that effectively brought our platform to its knees for about 45 minutes. We had basic load balancing in place, but we lacked the granular control necessary to isolate usage at the tenant level.

If you’re running a multi-tenant SaaS, you know that "fairness" isn't just a policy—it’s an architectural requirement. Implementing robust API rate limiting is the only way to ensure one user's bursty traffic doesn't degrade the experience for everyone else.

Why the Token Bucket Algorithm?

We first tried a simple fixed-window counter. It was easy to implement in Redis, but it suffered from the "boundary problem." If a user sent all their allowed requests at the very end of one minute and the start of the next, they essentially doubled their quota in a tiny window. It caused massive spikes that our downstream services couldn't handle.

We switched to the token bucket algorithm. Unlike fixed windows, token buckets allow for controlled bursts. Imagine a bucket that holds a maximum number of tokens; every request consumes one. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected. This provides a natural form of traffic shaping that smooths out spikes while allowing legitimate users some breathing room.

Implementing Multi-Tenant Architecture

In a multi-tenant architecture, you can't just apply a global limit. You need to bucketize your traffic based on a tenant ID. When a request hits our gateway, we extract the tenant_id from the JWT and perform a lookup.

Here is how we handle this in a distributed Node.js environment using Redis:


JAVASCRIPT
async function allowRequest(tenantId, bucketConfig) {
  const key = CE9178">`ratelimit:${tenantId}`;
  const now = Date.now();
  
  // Lua script ensures atomicity in Redis
  const script = CE9178">`
    local bucket = redis.call('hmget', KEYS[1], 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1] or ARGV[1])
    local last_refill = tonumber(bucket[2] or ARGV[2])
    
    local refill_rate = tonumber(ARGV[3])
    local max_tokens = tonumber(ARGV[1])
    
    local elapsed = math.max(0, (ARGV[2] - last_refill) / 1000)
    tokens = math.min(max_tokens, tokens + (elapsed * refill_rate))
    
    if tokens >= 1 then
      redis.call('hmset', KEYS[1], 'tokens', tokens - 1, 'last_refill', ARGV[2])
      return 1
    else
      return 0
    end
  `;
  
  return await redis.eval(script, 1, key, bucketConfig.max, now, bucketConfig.rate);
}

By using a Lua script, we ensure the read-modify-write cycle is atomic. If you try to do this with separate GET and SET commands, you'll run into race conditions where two concurrent requests both think there's one token left.

Distributed Systems Challenges

Managing state across multiple nodes is the hardest part. If you have 20 pods running, you can't keep local in-memory buckets for every tenant—the state would be inconsistent.

We centralized our state in a dedicated Redis cluster. While this introduces a network hop, the latency is usually around 1-2ms, which is negligible compared to the downstream database calls. If you find your Redis cluster becoming the bottleneck, you might consider API rate limiting at the edge: protecting your downstream services to filter out obvious abuse before it even touches your primary application logic.

If you're dealing with high-volume microservices, consider these trade-offs:

Centralized Redis: Easy to manage, consistent, but creates a single point of failure.
Local Buckets with Gossip: Higher performance, but eventually consistent. Tenants might occasionally exceed limits during a partition.
Hybrid: Use local limits for quick filtering and a centralized Redis store for the final enforcement.

Beyond Simple Throttling

Once you have your buckets in place, don't just return a 429 Too Many Requests. Be a good citizen. Return a Retry-After header so clients know when to back off. We've found that integrating API throttling: adaptive backoff strategies for resilient systems helps well-behaved clients recover gracefully without overwhelming the system again.

Also, remember that not all requests are created equal. A GET /health is vastly cheaper than a POST /reports/generate. We currently use different buckets for different route categories. It’s more complex to maintain, but it prevents a heavy report generation request from blocking a simple metadata fetch.

What I’d Do Differently

Looking back, we spent too much time trying to build a "one size fits all" configuration. In reality, some tenants have higher tiers and deserve higher burst capacity. We ended up hardcoding limits in our middleware, which meant a redeploy every time we wanted to change a tier.

Next time, I'd move the configuration into a dynamic store like etcd or even a cached database table. If you're currently struggling with spikes, start with a simple token bucket implementation, but keep your configuration decoupled from your code. It’s the difference between a 5-minute fix and a 2-hour deployment cycle.

Back to Blog

API Rate Limiting with Token Bucket Algorithms for Multi-Tenant SaaS

Why the Token Bucket Algorithm?

Implementing Multi-Tenant Architecture

Distributed Systems Challenges

Beyond Simple Throttling

What I’d Do Differently

Similar Posts

API Throttling: Adaptive Backoff Strategies for Resilient Systems

API Versioning Strategies: Maintaining Backward Compatibility at Scale

Idempotency keys: Making Retries Safe in Distributed Systems