Contents

HPA vs Rate-limit

INTRO

Strange… we are using HPA to increase availability and introducing rate limiting to reduce it?

Well, let’s create the context.

This analysis is based on specific assumptions:

  • Cloud environment
  • Dynamic infrastructure
  • Minimum resources available

HPA

In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (Deployment, StatefulSet) to match demand.

Patterns

TypeBehaviour
Slow and temporaryDaily fluctuations, peaking during the day and troughing at night
Rapid and temporaryShort bursts from poorly-behaved downstream services
Slow and persistentRequest volume slowly increases as the product sees adoption
Rapid and persistentAbrupt shift from low to high volumes — e.g. called by batch jobs

Ideal Practice

TypeIdeal Practice
Slow and temporaryHPA should add and remove pods as necessary
Rapid and temporaryHPA should NOT modify pod count — leave headroom for brief spikes
Slow and persistentHPA should add and remove pods as necessary
Rapid and persistentLeave headroom; HPA adds pods quickly to restore target utilization

Rate Limit

A rate limit is the number of API calls an app or user can make within a given time period. If this limit is exceeded — or if CPU or time limits are exceeded — the app may be throttled. Throttled requests fail.

Keep HPA patterns as reference even when designing rate limiting.

GOALS

  • Understand how much we can optimize application performance during autoscaling using rate limiting
  • Understand how to handle unexpected traffic with rate limiting

Limits

Sometimes legitimate traffic spikes occur. Search engine crawlers (Google bots etc.) generate significant traffic that shouldn’t trigger errors. This is the unexpected traffic case.

Hands-on

Simulation

Test setup:

  • Python app calculating Fibonacci sequences
  • Fibonacci number: 18500 (CPU-bound)
  • Load testing with Gatling
  • App CPU-capped: 10mcpu per request, 300mcpu max per pod
  • HPA trigger: keda flask_http_request_duration_seconds_count
$ curl http://xxx/api/fib/18500
8353329688443562486779853158514... etc

Real computation behind each request — no mock responses.

Note on Gatling’s “active users”:

(users alive at previous second) 
+ (users started during this second) 
- (users terminated during previous second)

This metric is more representative of application stress than simple concurrent requests.


NO AUTOSCALING

First run — 40rps max

rampUsers(10).during(60.seconds),
constantUsersPerSec(10).during(60.seconds),
rampUsersPerSec(10).to(40).during(5.minutes)

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-20.28.38.png

Result: Pod becomes unresponsive at ~33 rps. CPU hit ~28% of the 300mcpu limit. Active users spike after 33 rps.

Second run — 33rps max

rampUsers(10).during(60.seconds),
constantUsersPerSec(10).during(60.seconds),
rampUsersPerSec(10).to(30).during(4.minutes)

Better stability. Maximum sustainable: 33 rps on a single pod.

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.15.19.png


AUTOSCALING

rampUsers(10).during(60.seconds),
constantUsersPerSec(10).during(60.seconds),
rampUsersPerSec(10).to(99).during(15.minutes)

1 pod start - hpa 33rps - 15 min - 99 max requests

---- Errors --------------------------------------------------------------------
> found 503     8358 (45.83%)
> found 502     5529 (30.32%)
> found 504     4273 (23.43%)
> Request timeout  73 ( 0.40%)

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.16.02.png

System unresponsive at 72% of test completion.

2 pod start - hpa 33rps - 15 min - 99 max requests

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.17.54.png

Autoscaler cannot support traffic even with 2 pods. Scaling curve stresses the namespace unpredictably.

1 pod start - hpa 30rps - 15 min - 99 max requests

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.19.22.png

Scaling works but system crashes at end.

1 pod start - hpa 27rps - 15 min - 99 max requests

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.20.38.png

Worse than 30rps. Over-stress causes unpredictable failures.

[optimal] 1 pod start - hpa 24rps - 15 min - 99 max requests

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.24.14.png

Successful test. 65% of maximum single-pod capacity (24/33) is the safe autoscaling threshold.

1 pod start - hpa 25rps - 15 min - 99 max requests

One rps above optimal. System becomes unstable.


AUTOSCALING + Internal rate limit

See: Application Rate Limiting

[optimal] 1 pod start - hpa 25rps - 15 min - 99 max requests + rate-limit 27

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.30.24.png

Starting at 25rps (above the 24rps optimal without rate limiting). Errors are 429s, not 5xxs. App doesn’t crash. Slight queue buildup but stable.

1 pod start - hpa 26rps - 15 min - 99 max requests + rate-limit 28

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.41.18.png

One pod restart observed. Near the edge.

1 pod start - hpa 27rps - 15 min - 99 max requests + rate-limit 29

Fails catastrophically.


AUTOSCALING + Envoy rate limit

1 pod start - hpa 27rps - 15 min - 99 max requests + rate-limit 29 - envoy

Fails.

[optimal] 1 pod start - hpa 26rps - 15 min - 99 max requests + rate-limit 29 - envoy

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-21.44.45.png

Better than application-embedded rate limiting at the same rps. Envoy manages traffic externally, preventing internal saturation.


Unexpected traffic

Simulating crawler spikes mid-test:

rampUsers(30).during(60.seconds),
constantUsersPerSec(30).during(60.seconds).randomized,
rampUsersPerSec(30).to(82).during(4.minutes),
constantUsersPerSec(82).during(120.seconds).randomized,
rampUsersPerSec(82).to(170).during(10.seconds),  // spike
constantUsersPerSec(82).during(120.seconds).randomized,
rampUsersPerSec(82).to(130).during(20.seconds),
constantUsersPerSec(333).during(2.seconds),       // spike
constantUsersPerSec(333).during(10.seconds),      // sustained spike

/images/hpa-vs-rate-limit/logs-crawl-spikes-three-c-sm-border.png

3 pod start - hpa 26rps - rate-limit 27 internal

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-22.08.07.png

Mix of 429s and 5xxs. System unstable during spikes.

[optimal] 3 pod start - hpa 26rps - rate-limit 27 - envoy

/images/hpa-vs-rate-limit/Screenshot-2023-02-16-at-22.30.45.png

No issues. Envoy absorbs the spikes externally. Python app never sees the overload.


Conclusions

HPA

While a single pod can sustain 33 rps in isolation, autoscaling scenarios reduce this threshold. Applications should operate at ~65% of maximum single-pod capacity before triggering scale events.

The Gatling “active users” metric is more representative of application stress than traditional flask_http_request_duration_seconds_count.

Rate Limit

Can rate limiting increase capacity during autoscaling? No — or minimally. It allows ~5% more rps but increases sensitivity. Rate limiting and HPA fight each other if not carefully tuned.

Can rate limiting handle unexpected traffic? Yes. This is the ideal use case. External rate limiting (Envoy, API Gateway, WAF) outperforms application-embedded rate limiting because it manages traffic before it enters the application’s resource pool.

Both

  • Do not use the same metric to drive both HPA and rate limiting
  • Both operate on rps thresholds but with different profiling approaches
  • Treat them as complementary, not competing

Costs

Unexpected traffic sources:

  • ~5%: Internal deployments, batch jobs
  • ~95%: External calls via public/private endpoints → WAF or API Gateway rate limiting

As resource optimization increases, corner cases multiply. Evaluate rate limiting concepts carefully against cost-saving goals. A system at 100% capacity has no margin for legitimate traffic spikes.