Uploaded image for project: 'Insights Experiences'
  1. Insights Experiences
  2. HMS-1782

AWS client does not keep up with job queue concurrency

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • Provisioning
    • None
    • EnVision Sprint 28, EnVision Sprint 29, EnVision Sprint 30, EnVision Sprint 31
    • None

      We were aware of the AWS API throttling but we did not have time to test it. Until now.

      rhn-engineering-pablomh conducted performance testing of our Launch service and it appears that AWS API throttling is causing serious issues to the job queue. Reservations time out and not 100% of reservations are processed. We currently see the following bugs or issues:

      • AWS returned nil for Public IP address and we dereference it without check.
      • EC2 client, Redis client or Postgres client context cancel

      What likely happens is that our job queue is supposed to cap in-flight jobs at number of CPUs + 1 per pod, not more than 64 (configurable value). Since we are running three pods, total maximum allowed amount of goroutines executing jobs was meant to be 3*64, but monitoring revealed we had almost 350 in-flight jobs during the test. This code does not work and indeed, instead of hard-capping the goroutines it sets redis connection pool size.

      The AWS client has built in backoff alghorithm that waits longer and longer for requests to complete which is leading to reaching the maximum limit (30 minutes by default) when jobs are automatically cancelled.

      Fix for this should mitigate the throttling problem at the cost of slower reservation times (jobs will have to wait in the queue longer). The offender is likely RunInstances which AWS throttles at the rate of 2 calls per second when initial bucket of 1000 is exhausted.

      https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html

      So next steps:

      As part of this issue, let’s fix in-flight ceiling properly and set it to some reasonable value e.g. 33 goroutines per pod = 99 concurrent workers for three pods, they can easily empty the 1000 requests bucket in about 100 seconds if I do the math right, but then the backoff alghorithm of ec2 client will kick in. Still some workers will starve (and could be cancelled).

      The better solution (not part of this patch) will be to redesign how we deploy the application. Instead of 3 generic worker pods, we might need three different configurations. One pod running AWS jobs, one pod running Azure jobs and one for GCP. Total number of pods will be the same, but now, we can scale API requests individually without need of any sort of global state. The solution can be complex (doing the math) or simple (allow X concurrency and if client is hitting throttling do not accept new jobs until all are dispatched so no starving is done). Let’s talk about this on tech discussion.

      Also I will file a separate issue for the nil on the public IP address, this is unrelated issue.

              rhn-engineering-lzapletal Lukáš Zapletal
              rhn-engineering-lzapletal Lukáš Zapletal
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: