Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Provisioning
Labels:
None

Git Pull Request:
https://github.com/RHEnVision/provisioning-backend/pull/532

Sprint:
EnVision Sprint 28, EnVision Sprint 29, EnVision Sprint 30, EnVision Sprint 31

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Blocker:
None

We were aware of the AWS API throttling but we did not have time to test it. Until now.

rhn-engineering-pablomh conducted performance testing of our Launch service and it appears that AWS API throttling is causing serious issues to the job queue. Reservations time out and not 100% of reservations are processed. We currently see the following bugs or issues:

AWS returned nil for Public IP address and we dereference it without check.
EC2 client, Redis client or Postgres client context cancel

What likely happens is that our job queue is supposed to cap in-flight jobs at number of CPUs + 1 per pod, not more than 64 (configurable value). Since we are running three pods, total maximum allowed amount of goroutines executing jobs was meant to be 3*64, but monitoring revealed we had almost 350 in-flight jobs during the test. This code does not work and indeed, instead of hard-capping the goroutines it sets redis connection pool size.

The AWS client has built in backoff alghorithm that waits longer and longer for requests to complete which is leading to reaching the maximum limit (30 minutes by default) when jobs are automatically cancelled.

Fix for this should mitigate the throttling problem at the cost of slower reservation times (jobs will have to wait in the queue longer). The offender is likely RunInstances which AWS throttles at the rate of 2 calls per second when initial bucket of 1000 is exhausted.

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html

So next steps:

As part of this issue, let’s fix in-flight ceiling properly and set it to some reasonable value e.g. 33 goroutines per pod = 99 concurrent workers for three pods, they can easily empty the 1000 requests bucket in about 100 seconds if I do the math right, but then the backoff alghorithm of ec2 client will kick in. Still some workers will starve (and could be cancelled).

The better solution (not part of this patch) will be to redesign how we deploy the application. Instead of 3 generic worker pods, we might need three different configurations. One pod running AWS jobs, one pod running Azure jobs and one for GCP. Total number of pods will be the same, but now, we can scale API requests individually without need of any sort of global state. The solution can be complex (doing the math) or simple (allow X concurrency and if client is hitting throttling do not accept new jobs until all are dispatched so no starving is done). Let’s talk about this on tech discussion.

Also I will file a separate issue for the nil on the public IP address, this is unrelated issue.

is related to

HMS-1786 Improve handling of stress on AWS API

Closed

Assignee:: Lukáš Zapletal

Reporter:: Lukáš Zapletal

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/05/11 8:23 AM

Updated:: 2024/05/22 9:17 PM

Resolved:: 2023/06/27 7:44 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates