Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8811

Feature Request: Graceful Pause/Resume Capability for GPU Node Upgrades

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • all
    • MCO
    • None
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed title of this feature request
      During OCP Upgrade, include feature to 

      • display next worker node, that is going to be picked for upgrade
      • add delay with manual intervention between 2 worker nodes

      2. What is the nature and description of the request?
      Currently, worker nodes are randomly picked up during OCP upgrade in progress. 
      As we have limited capacity of GPU nodes, with limited to no buffer capacity available and with no visibility towards the next worker node, that is going to be picked for OCP upgrade; it becomes difficult for us to arrange buffer capacity for the needed profile of GPU node, that gets randomly picked next for upgrade. 
      Due to no buffer capacity available, the next randomly picked GPU node fails to drain its workloads, causing the OCP upgrade to get stuck and also cause outage for our clients; thereby, leading to manual intervention and significant drop in availability of the platform for AI/ML workloads

      3. Why does the customer need this? (List the business requirements here)
      The visibility toward next worker node to be picked for OCP Upgrade, will help us prepare for buffer capacity in advance, for the GPU profile of the next node
      This manually controlled delay between 2 randomly picked GPU nodes will help us in buying in sufficient time to make arrangements for the needed GPU Profile node and ensure that OCP upgrade doesnt get stuck due to no buffer capacity and avoid outage for clients, as well as also ability to maintain SLA

      4. List any affected packages or components.
      not sure

              rhn-support-mrussell Mark Russell
              rhn-support-hmongia Harshit Mongia
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                None
                None