-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20.z
-
None
-
None
-
False
-
-
None
-
None
-
None
-
x86_64
-
Dev
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Single Node Openshift - 4.20.3, AWS, g6.4xlarge, L4 GPU, time-sliced
--- apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: nvidia-gpu-operator data: NVIDIA-L4: |- # this must match node labels nvidia.com/gpu.product=NVIDIA-L4 and nvidia.com/device-plugin.config=NVIDIA-L4 version: v1 flags: migStrategy: none sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 8
RHOAI 3.0 with OCP Kueue Operator (v1.1.0) deployed.
DSC configured so RHBoK in control:
kueue: defaultClusterQueueName: default defaultLocalQueueName: default managementState: Unmanaged
i added one new worker GPU Node to my SNO cluster using a MachineSet.
Seems like the Kueue ClusterQueue was not regenerated / it did not pick up the new node.
i had to delete the default ClusterQueue to get the new resource quota showing.
oc get clusterqueue - previous with one node (with time slicing) gives:
coveredResources:
- nvidia.com/gpu
flavors:
- name: nvidia-gpu-flavor
resources:
- name: nvidia.com/gpu
nominalQuota: "8"
and new (two nodes with time slicing) after runninng oc delete clusterqueue:
coveredResources:
- nvidia.com/gpu
flavors:
- name: nvidia-gpu-flavor
resources:
- name: nvidia.com/gpu
nominalQuota: "16"
Version-Release number of selected component (if applicable):
kueue-operator.v1.1.0
How reproducible:
Always
Steps to Reproduce:
1. SNO node with 4.20.3 ocp 2. Add new MachineSet g6.2xlarge - scale to 1 3. Wait for nfd + nvidia gpu to deploy 4. Check oc get clusterqueue default -o yaml
Actual results:
New gpu machine not recognized in clusterqueue
Expected results:
New gpu machine recognized in clusterqueue. ClusterQueue reconciles to cluster topology.
Additional info:
This could well be on the RHOAI side - not sure which component generates the Cluster + Local Kueue configuration.