Uploaded image for project: 'Ansible Automation Platform RFEs'
  1. Ansible Automation Platform RFEs
  2. AAPRFE-1865

Add Ability to scale down resources that are scaled up on self hosted AAP on ARO

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 2.4, 2.5
    • controller
    • False
    • Hide

      None

      Show
      None
    • False

      Business Impact:
      This initiative is critical to the success and satisfaction of our strategic partner, Kyndryl. The platform's current inability to autoscale down is causing them to incur significant and unnecessary cloud costs, creating a poor experience for their end-users, and is now a direct impediment to their growth. Resolving this is essential to strengthening our partnership and ensuring their success on our platform.

      Business Requirements:
      Reduce Operational Costs: Kyndryl's cloud spending is unnecessarily high because they are paying for peak capacity 24/7. They need to align costs directly with real-time usage, a core value proposition they feel they are "missing out on."Eliminate End-User Frustration: To manage costs, Kyndryl is forced to limit their infrastructure, which causes job backlogs and delays for their clients. This is leading to what they have reported as "significant end user frustration."Support Business Growth: The problem worsens as Kyndryl adds more clients. They have stated, "Our current situation continues to get worse as we add additional clients and workload." We must provide a platform that enables, not penalizes, their growth.

      1. What is the nature and description of the request?
      Today automation jobs are not backed by controller objects[1] which prevent scale-down events[2] in cluster autoscaler. 

      2. Why does the customer need this? (List the business requirements here)
      The customer would like to leverage cluster autoscaler to autoscale worker nodes needed to run large amounts of ansible automation jobs. Instead of running a set number of worker nodes and having jobs sit in a pending state when running at full capacity. Today, this works fine for scale-up events but the issue arises when cluster autoscaler attempts to scale-down. Due to the way cluster autoscaler works it will not taint the node if the pod is not backed by a controller object allowing new job pods to be placed on the node even if other node's have capacity to run them.

      3. How would you like to achieve this? (List the functional requirements here)
      Add the ability to change automation jobs from pod specs to a controller backed object (such as a kubernetes job)[1]. Then in combination with the"cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation jobs would not be removed and continue to run while allowing cluster autoscaler to taint the worker node for removal and preventing new job pods from being placed on it. 

      List any affected known dependencies: Doc, UI etc..
      Unknown

      Github Link if any
      N/A

      [1] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

      [2] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work

              bcoursen@redhat.com Brian Coursen
              rh-ee-mtipton Michael Tipton
              Votes:
              4 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: