Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-5543

When using the Nvidia GPU Operator, more nodes than needed are created by the Node Autoscaler

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • Integrations
    • False
    • None
    • False
    • Release Notes
    • No
    • No
    • Hide
      == When using the Nvidia GPU add-on, more nodes than needed are created by the Node Autoscaler
      When a pod cannot be scheduled due to insufficient available resources, the Node Autoscaler creates a new node. There is a delay until the newly created node receives the relevant GPU workload. Consequently, the pod cannot be scheduled and the Node Autoscaler's continuously creates additional new nodes until one of the nodes is ready to receive the GPU workload. For more information about this issue, see link:https://access.redhat.com/solutions/6055181[When using the Nvidia GPU Operator, more nodes than needed are created by the Node Autoscaler].

      *Workaround*: Apply the `cluster-api/accelerator` label in `machineset.spec.template.spec.metadata`. This causes the autoscaler to consider those nodes as unready until the GPU driver has been deployed.
      endif::[]
      Show
      == When using the Nvidia GPU add-on, more nodes than needed are created by the Node Autoscaler When a pod cannot be scheduled due to insufficient available resources, the Node Autoscaler creates a new node. There is a delay until the newly created node receives the relevant GPU workload. Consequently, the pod cannot be scheduled and the Node Autoscaler's continuously creates additional new nodes until one of the nodes is ready to receive the GPU workload. For more information about this issue, see link: https://access.redhat.com/solutions/6055181 [When using the Nvidia GPU Operator, more nodes than needed are created by the Node Autoscaler]. *Workaround*: Apply the `cluster-api/accelerator` label in `machineset.spec.template.spec.metadata`. This causes the autoscaler to consider those nodes as unready until the GPU driver has been deployed. endif::[]
    • Known Issue
    • Done
    • No
    • Pending
    • None

    Description

      We have an issue with autoscaling nodes with gpus.

      If a user requests a notebook with atleast one gpu and no currently running nodes are able to accept it, a gpu node gets correctly scaled up. However by the time the nvidia dcgm exporter is up on this node, the spawned notebook is still reporting lack of gpus so the cluster autoscales several more nodes until atleast one node has an exporter running and accepts the notebook pod.

      It is a known issue: https://access.redhat.com/solutions/6055181

      Related RHODS issue: https://issues.redhat.com/browse/RHODS-4617

      Attachments

        Activity

          People

            Unassigned Unassigned
            rh-ee-mroman Maros Roman (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: