Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-65857

RHBoK ClusterQueue not regenerated when new GPU node added to cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.20.z
    • Node / Kueue
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • x86_64
    • Dev
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Single Node Openshift - 4.20.3, AWS, g6.4xlarge, L4 GPU, time-sliced

      ---
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: time-slicing-config
        namespace: nvidia-gpu-operator
      data:
          NVIDIA-L4: |-  # this must match node labels nvidia.com/gpu.product=NVIDIA-L4 and nvidia.com/device-plugin.config=NVIDIA-L4
            version: v1
            flags:
              migStrategy: none
            sharing:
              timeSlicing:
                resources:
                - name: nvidia.com/gpu
                  replicas: 8 

      RHOAI 3.0 with OCP Kueue Operator (v1.1.0) deployed.

      DSC configured so RHBoK in control:

          kueue:
            defaultClusterQueueName: default
            defaultLocalQueueName: default
            managementState: Unmanaged

      i added one new worker GPU Node to my SNO cluster using a MachineSet.

      Seems like the Kueue ClusterQueue was not regenerated / it did not pick up the new node.

      i had to delete the default ClusterQueue to get the new resource quota showing.

      oc get clusterqueue - previous with one node (with time slicing) gives: 

      coveredResources:
            - nvidia.com/gpu
            flavors:
            - name: nvidia-gpu-flavor
              resources:
              - name: nvidia.com/gpu
                nominalQuota: "8"

      and new (two nodes with time slicing) after runninng oc delete clusterqueue: 

      coveredResources:
            - nvidia.com/gpu
            flavors:
            - name: nvidia-gpu-flavor
              resources:
              - name: nvidia.com/gpu
                nominalQuota: "16"

      Version-Release number of selected component (if applicable):

      kueue-operator.v1.1.0

      How reproducible:

      Always

      Steps to Reproduce:

      1. SNO node with 4.20.3 ocp
      2. Add new MachineSet g6.2xlarge - scale to 1
      3. Wait for nfd + nvidia gpu to deploy
      4. Check oc get clusterqueue default -o yaml    

      Actual results:

      New gpu machine not recognized in clusterqueue

      Expected results:

      New gpu machine recognized in clusterqueue. ClusterQueue reconciles to cluster topology.  

      Additional info:

      This could well be on the RHOAI side - not sure which component generates the Cluster + Local Kueue configuration.

              aos-node@redhat.com Node Team Bot Account
              rhn-sa-mhepburn Mike Hepburn
              None
              None
              Alice Nahas Alice Nahas
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: