Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-5906

NotTriggerScaleUp error with GPU autoscaler defined

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • Integrations
    • False
    • None
    • False
    • Testable
    • No
    • No
    • Hide
      == The NVIDIA GPU Operator is incompatible with OpenShift 4.11.12
      Provisioning a GPU node on a OpenShift 4.11.12 cluster results in the `nvidia-driver-daemonset` pod getting stuck in a CrashLoopBackOff state.
      The NVIDIA GPU Operator is compatible with OpenShift 4.11.9 and 4.11.13.
      Show
      == The NVIDIA GPU Operator is incompatible with OpenShift 4.11.12 Provisioning a GPU node on a OpenShift 4.11.12 cluster results in the `nvidia-driver-daemonset` pod getting stuck in a CrashLoopBackOff state. The NVIDIA GPU Operator is compatible with OpenShift 4.11.9 and 4.11.13.
    • Known Issue
    • No
    • Pending
    • None

    Description

      Description of problem:

      It's possible that during a scale up request from RHODS, the gpu node gets scaled down automatically by the autoscaler because it takes too long to start and get labelled by the nvidia gpu addon.
      I've seen this happen when a bug prevented the nvidia addon from working on OCP 4.11 – the gpu-feature-discovery pod was in "Init" state on the node (so the nvidia.com labels on the node were still missing), the spawner was still waiting to schedule the pod, and then the machine/node got destroyed, while the machineSet got scaled down to 0 replicas from 1.

      Once this happens, any attempt made to spawn a notebook with gpu after always results in an error:

      The scale up is not triggered anymore, and the only way to unblock the cluster and try again is to delete the autoscaler/machine pool entirely from the cluster and create a new one.

      Note that "nvidia.com/gpu" is a label applied by the nvidia addon, and in my case the addon was unable to run on the node and label it during the first autoscale request; Further requests seem to "remember" that this autoscaler did not result in a node with the nvidia.com/gpu label.

      Prerequisites (if any, like setup, operators/versions):

      RHODS 1.19.0-14 on OCP 4.11

      Steps to Reproduce

      1. Install RHODS
      2. Define GPU autoscaler
      3. Install Nvidia Addon
      4. Request server with GPU(s) to trigger autoscale
      5. (unclear) somehow prevent gpu node from being ready until automatically scaled down
        1. In my case this happened on its own because of an incompatibility between nvidia addon and OCP 4.11.12
      6. Request another server with GPU(s) to trigger autoscale

      Actual results:

      Any further autoscale requests would fail with error "NotTriggerScaleUp" because of "insufficient nvidia.com/gpu"

      Expected results:

      Autoscale is triggered again – in theory completing successfully and scheduling the server

      Reproducibility (Always/Intermittent/Only Once):

      Always on affected cluster (one)

      Build Details:

      RHODS 1.19.0-14, OCP 4.11.12

      Workaround:

      Delete autoscaling machine pool, create a new one, try requesting server again

      In my case the issue was not solved because of the incompatibility between the nvidia addon and OCP 4.11.12, but I was at least able to trigger the autoscaler again

      Additional info:

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhn-support-lgiorgi Luca Giorgi
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: