HPA vs Rate-limit

14 Feb 2023

INTRO
- HPA
- Rate Limit
GOALS
Limits
Hands-on
- Simulation

Strange ... we are using hpa to increase the availability and we are introducing rate limit to reduce?

Well let's create the context...

INTRO

This story is not a true o false sentence, it's a sort of analysis based on a few assumptions:

cloud environment
dynamic infrastructure
minimum resources available

HPA

In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.

Patterns

Meaning

Type	Behaviour
Slow and temporary	it might have daily fluctuations in requests volume, peaking during the day and troughing at night
Rapid and temporary	it might be subject to short bursts of high request volume from poorly-behaved downstream services
Slow and persistent	it might see its request volume slowly increase over rime, as the product sees greater adoption
Rapid and persistent	it might see abrupt shift from low to high volumes, such as if it's called by batch jobs

Ideal Practice

Type	Behaviour
Slow and temporary	The HPA should add and remove pods as necessary
Rapid and temporary	The HPA should not modify the pod count; instead, the service should leave enough headroom to deal with these brief spikes with only existing pods
Slow and persistent	The HPA should add and remove pods as necessary
Rapid and persistent	The service should leave enough headroom to deal with the rapid change, and the HPA should add pods soon after to bring the service back to target utilization

Rate Limit

A rate limit is the number of API calls an app or user can make within a given time period. If this limit is exceeded or if CPU or total time limits are exceeded, the app or user may be throttled. API requests made by a throttled user or app will fail. All API requests are subject to rate limits.

However, keep the HPA patterns as a reference even for this one.

GOALS

Understand how much we can close with the application enervation during an autoscaling situation using rate limit
Understand how to handle unexpected traffic with rate limit

Limits

Sometimes we can have a spike of requests that are legit... for example consider the google crawlers (image from ... somewhere in google image)

Hands-on

Simulation

python app that generates fibonacci sequence
fibonacci number 18500
load test done using gatling
Application capped on cpu (each request is ~10mcpu) with max of 300mcpu for pod (cpu bound)
Hpa based on keda flask_http_request_duration_seconds_count

Results Example

$ curl http://xxx/api/fib/18500
8353329688443562486779853158514... etc etc

This ensures that it is not just an echo answer and that we have work behind the scenes.

Note

Gatling is using an interesting concept that call "active users"

“Active users” is neither “concurrent users” or “users arrival rate”. It’s a kind of mixed metric that serves for both open and closed workload models and that represents “users who were active on the system under load at a given second”.

It’s computed as:

(number of alive users at previous second)
+ (number of users that were started during this second)
- (number of users that were terminated during previous second)

NO AUTOSCALING

First run 40rps max

rampUsers(10).during(60.seconds), // increase users to 10 in 60 sec
constantUsersPerSec(10).during(60.seconds), // keep 10 users for 60 sec
rampUsersPerSec(10).to(40).during(5.minutes), // increase users to 40 in 5 min

As soon as we reach the limit the pods start to be unresponsive

No metrics after 33 rps, pod crashed

No metrics after >28% of cpu (consider the max as 30% as the pod limit is 300mcpu)

Active users increase dramatically after 33 rps

Second run 33rps max

rampUsers(10).during(60.seconds), // increase users to 10 in 60 sec
constantUsersPerSec(10).during(60.seconds), // keep 10 users for 60 sec
rampUsersPerSec(10).to(30).during(4.minutes), // increase users to 30 in 4 min

Really better, we can say that the MAX rps is 33 ... this valid one-shot in a single run

AUTOSCALING

rampUsers(10).during(60.seconds), // increase users to 10 in 60 sec
constantUsersPerSec(10).during(60.seconds), // keep 10 users for 60sec
rampUsersPerSec(10).to(99).during(15.minutes), // increase users to 99 in 15 min

1 pod start - hpa 33rps - 15 min - 99 max requests

DEAD !!!

Test stopped as per ...

================================================================================
2023-02-09 10:01:38 885s elapsed
---- Requests ------------------------------------------------------------------
> Global (OK=17944 KO=18236 )
> request_1 (OK=17944 KO=18236 )
---- Errors --------------------------------------------------------------------
> status.find.in(200,201,202,203,204,205,206,207,208,209,304), f 8358 (45.83%)found 503
> status.find.in(200,201,202,203,204,205,206,207,208,209,304), f 5529 (30.32%)found 502
> status.find.in(200,201,202,203,204,205,206,207,208,209,304), f 4273 (23.43%)found 504
> i.g.h.c.i.RequestTimeoutException: Request timeout to oracolo. 73 ( 0.40%)
> j.i.IOException: Premature close 3 ( 0.02%)
---- BasicSimulation -----------------------------------------------------------
[#####################################################-- ] 72%
waiting: 12549 / active: 932 / done: 36180
================================================================================

2 pod start - hpa 33rps - 15 min - 99 max requests

Naaaa... still not able to scale

1 pod start - hpa 30rps - 15 min - 99 max requests

Ok we are able to scale but at the end we have a complete crash

1 pod start - hpa 27rps - 15 min - 99 max requests

Oh noooo ... worse than before with fewer rps ... since we are over-stressed when start to crash can have weird behaviour

[optimal] 1 pod start - hpa 24rps - 15 min - 99 max requests

Here we are !!!

1 pod start - hpa 25rps - 15 min - 99 max requests

let's try with one more

Naaaa .... no more than 24 rps

AUTOSCALING Internal rate limit

Here how to

[optimal] 1 pod start - hpa 25rps- 15 min - 99 max requests rate-limit 27

I suppose that the optimal before (24rps) is working good also with RL so let’s start with 25

Better than the same rps with no ratelimit

The errors are just the 429 triggered by rate limit … so the good answer are ok and no issue or restart on the application during the autoscaling… a bit of queue

No so bad considering the errors in red as just rate limited calls

1 pod start - hpa 26rps - 15 min - 99 max requests rate-limit 28

Just one error for pod unresponsive

1 pod start - hpa 27rps - 15 min - 99 max requests rate-limit 29

Broken!!!

AUTOSCALING envoy rate limit

1 pod start - hpa 27rps - 15 min - 99 max requests rate-limit 29 - envoy

Broken!!!

[optimal] 1 pod start - hpa 26rps - 15 min - 99 max requests rate-limit 29 - envoy

Better than expected compared to the solution with rate limit embedded

Unexpected traffic

rampUsers(30).during(60.seconds),
constantUsersPerSec(30).during(60.seconds).randomized,
rampUsersPerSec(30).to(82).during(4.minutes),
constantUsersPerSec(82).during(120.seconds).randomized,
rampUsersPerSec(82).to(170).during(10.seconds),
constantUsersPerSec(82).during(120.seconds).randomized,
rampUsersPerSec(82).to(130).during(20.seconds),
constantUsersPerSec(82).during(120.seconds).randomized,
rampUsersPerSec(82).to(210).during(30.seconds),
constantUsersPerSec(82).during(120.seconds).randomized,
constantUsersPerSec(333).during(2.seconds),
constantUsersPerSec(82).during(120.seconds).randomized,
constantUsersPerSec(333).during(10.seconds),
constantUsersPerSec(82).during(80.seconds)

3 pod start - hpa 26rps - 15 min - 82 avg requests with spikes - rate-limit 27 internal

Since the incoming requests are absorbed from the app we reach the limit
Errors are not only 429 as rate but 5xx

Broken!!!

[optimal] 3 pod start - hpa 26rps - 15 min - 82 avg requests with spikes - rate-limit 27 - envoy

No issue at all , since the rl is performed by envoy the python app has no “spikes”

Conclusions

HPA

This application is cpu bound … so I used req/s even if the right autoscaling metric is the cpu usage, however, I would say that each application has the intent to serve requests.
Create a dedicated configuration, application per application, for the hpa is the best option but it requires HUGE know-how and time …
Considering the sentence before, I was impressed about Active users since share the correct clue at high levels about the status of the application, it’s really really close to measuring the enervation of the app

I would say that the concept of active users as a HPA metric is better than flask_http_request_duration_seconds_count

Even if we can reach 33rps in a single pod , we should consider that in an autoscaling scenario this will be reduced to less, cause for a certain amount of time we have more requests till we are waiting for new pods/node etc etc

I would say that is better to have an application running at 65% of max capacity before scale

Rate Limit

Understand how much we can close with the application enervation during an autoscaling situation using rate limit

No … or better … we can increase a bit the limit to exploit the app at max possible, however, it’s just a risk for 5% more in rps.
The application will be more sensible and the rl is really close to the hpa

Understand how to handle unexpected traffic with a rate limit

Yes … this pattern is the right one for the rate limit
Better is to perform this action outside the application container since will impact unfortunately the incoming requests, the saturation etc etc and you have rl and hpa that can fight each other

Both

Do not force hpa to collaborate with rl … should be 2 different parameters

In this scenario are both rps ... but are profiled in a different way , so do not consider as the same metric or meaning

Costs

In some way trying to save money by increasing to the possible max rps is risky, we should leverage the kubernetes pod scheduling more than try to reach 100% of the application

Highlights

Unexpected traffic can be something that is generated for a wrong deployment/batch 5% of cases or by external calls 95%

The external calls come from public or private endpoints can be managed by

waf rate limit
api gateway rate limit

Those two solutions are better, cause they can keep the infrastructure safe as a first point of contact but ... A big BUT

In a dynamic infrastructure where we have autoscaling and downscaling, a legit peak like google scrape is not possible to handle, cause the RL is service level
AND…
during the night, it's supposed to have the minimum pods available as there is less user traffic for example.

So … I’ll remark on the drawback of cost savings following the matrix based on hpa patterns (cit brewster's millions)
As much as we are close to using fewer resources as possible, as much we are increasing corner cases and unexpected situations.

Be careful to evaluate the rate limit concept

Cheers :-)

Table of Contents

INTRO

HPA

Patterns

Meaning

Ideal Practice

Rate Limit

GOALS

Limits

Hands-on

Simulation

Note

NO AUTOSCALING

First run 40rps max

Second run 33rps max

AUTOSCALING

1 pod start - hpa 33rps - 15 min - 99 max requests

2 pod start - hpa 33rps - 15 min - 99 max requests

1 pod start - hpa 30rps - 15 min - 99 max requests

1 pod start - hpa 27rps - 15 min - 99 max requests

[optimal] 1 pod start - hpa 24rps - 15 min - 99 max requests

1 pod start - hpa 25rps - 15 min - 99 max requests

AUTOSCALING Internal rate limit

[optimal] 1 pod start - hpa 25rps- 15 min - 99 max requests rate-limit 27

1 pod start - hpa 26rps - 15 min - 99 max requests rate-limit 28

1 pod start - hpa 27rps - 15 min - 99 max requests rate-limit 29

AUTOSCALING envoy rate limit

1 pod start - hpa 27rps - 15 min - 99 max requests rate-limit 29 - envoy

[optimal] 1 pod start - hpa 26rps - 15 min - 99 max requests rate-limit 29 - envoy

Unexpected traffic

3 pod start - hpa 26rps - 15 min - 82 avg requests with spikes - rate-limit 27 internal

[optimal] 3 pod start - hpa 26rps - 15 min - 82 avg requests with spikes - rate-limit 27 - envoy

Conclusions

HPA

Rate Limit

Both

Costs

Highlights