When Horizontal Scaling Hits Rate Limits

More instances can make 429s worse when upstream quotas are global. How we reframed it as a coordination problem — queue, global limiter, and throughput shaping instead of scaling alone.

March 23, 2026

last week, I was working on an API-heavy service that sat between our frontend and a mix of third-party providers plus internal microservices. On paper, everything looked fine — latencies were acceptable, dependencies were stable, and our infra could scale horizontally.

Then traffic ramped up.

Not dramatically. Just enough to expose something we had not truly designed for. Suddenly, we started seeing intermittent 429 Too Many Requests errors across multiple integrations. What made it worse was that our system was not actually overloaded. CPU was fine. Memory was fine. But throughput plateaued, and error rates climbed.

This was not a scaling problem in the traditional sense. It was a coordination problem.

Two competing forces

At a high level, we had two competing forces:

  • We wanted to maximize throughput (requests per second).
  • Our dependencies enforced strict rate limits (per minute).

The naive approach we initially took was simple: if we need more throughput, just scale horizontally.

That brute force worked, but I knew it was not the best approach

Each new instance increased the total outgoing request rate, and suddenly we were exceeding upstream limits faster. More servers did not help — they made it worse.

Constraints we were dealing with:

  • Third-party APIs with per-minute limits
  • No centralized coordination between instances
  • Bursty traffic patterns, especially from retries and batch jobs
  • Strict latency expectations from the frontend

Failure modes we observed:

  • Cascading retries amplifying traffic spikes
  • Thundering herd on recovery after brief outages
  • Uneven distribution of requests across instances
  • Silent throughput caps despite "healthy" infrastructure

Initial thinking

My first instinct was to handle errors better: add exponential backoff, retry on 429s, cache more aggressively.

These helped but they did not solve the core issue: we were reacting to rate limits instead of respecting them upfront.

Another idea was per-instance rate limiting. Each service instance would enforce its own cap.

That failed too.

Rate limits were global (for example, 100 requests per second across all instances). Each instance had no awareness of others. We ended up overshooting limits anyway.

Why coordination matters

Simplify the system for a moment:

Clients → Our service (N instances) → External API (rate-limited)

Each instance independently sends requests downstream.

If the external API allows 100 req/s and we have 10 instances, then each instance must average 10 req/s — in theory.

Reality is messier:

  • Traffic is not evenly distributed
  • Instances scale dynamically
  • Retries skew the numbers

The real issue is lack of coordination. It is like multiple workers pulling from the same shared quota without talking to each other.

A useful mental model is a token bucket: tokens represent allowed requests, tokens refill at a fixed rate, and each request consumes a token. Without a shared bucket, every instance behaves as if it has the full quota.

Where throughput breaks

Throughput collapses when we exceed rate limits and get 429s, retry and increase load, create bursts that trigger stricter limiting, watch latency rise as queues build, and end up with an unstable system.

So paradoxically: trying to maximize throughput without coordination reduces actual throughput.

Solution: central limiter, queue, backpressure

We moved to a centralized rate limiting model with controlled concurrency.

Ideas that mattered:

  • Global rate limit enforcement
  • Queueing instead of immediate execution
  • Backpressure instead of piling on retries
  • Throughput shaping, not just scaling out

Architecture (conceptual):

What changed:

  • Requests are queued instead of executed immediately on arrival.
  • A shared rate limiter controls dispatch into the downstream calls.
  • Workers pull work only when allowed.
  • Retries drop because we avoid hitting limits in the first place.

This pattern trades a bit of latency for predictability. For our case, that was the right trade: the frontend could still get strict SLAs if we sized the queue and worker pool honestly, but we stopped treating global upstream quotas as something each instance could ignore.

If your bottleneck is not CPU or memory but shared external capacity, scaling out without a single place that owns the budget will keep rediscovering the same 429s.

That should keep us under the limit—until and within budget

(In the meantime, I'll be quietly negotiating with my wallet for a higher vendor quota.)