Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.20.z
Component/s: Node / Kueue
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None
Architecture:

x86_64
Deployment Environment:
Dev

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Single Node Openshift - 4.20.3, AWS, g6.4xlarge, L4 GPU, time-sliced

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: nvidia-gpu-operator
data:
    NVIDIA-L4: |-  # this must match node labels nvidia.com/gpu.product=NVIDIA-L4 and nvidia.com/device-plugin.config=NVIDIA-L4
      version: v1
      flags:
        migStrategy: none
      sharing:
        timeSlicing:
          resources:
          - name: nvidia.com/gpu
            replicas: 8

RHOAI 3.0 with OCP Kueue Operator (v1.1.0) deployed.

DSC configured so RHBoK in control:

    kueue:
      defaultClusterQueueName: default
      defaultLocalQueueName: default
      managementState: Unmanaged

i added one new worker GPU Node to my SNO cluster using a MachineSet.

Seems like the Kueue ClusterQueue was not regenerated / it did not pick up the new node.

i had to delete the default ClusterQueue to get the new resource quota showing.

oc get clusterqueue - previous with one node (with time slicing) gives:

coveredResources:
      - nvidia.com/gpu
      flavors:
      - name: nvidia-gpu-flavor
        resources:
        - name: nvidia.com/gpu
          nominalQuota: "8"

and new (two nodes with time slicing) after runninng oc delete clusterqueue:

coveredResources:
      - nvidia.com/gpu
      flavors:
      - name: nvidia-gpu-flavor
        resources:
        - name: nvidia.com/gpu
          nominalQuota: "16"

Version-Release number of selected component (if applicable):

kueue-operator.v1.1.0

How reproducible:

Always

Steps to Reproduce:

1. SNO node with 4.20.3 ocp
2. Add new MachineSet g6.2xlarge - scale to 1
3. Wait for nfd + nvidia gpu to deploy
4. Check oc get clusterqueue default -o yaml

Actual results:

New gpu machine not recognized in clusterqueue

Expected results:

New gpu machine recognized in clusterqueue. ClusterQueue reconciles to cluster topology.

Additional info:

This could well be on the RHOAI side - not sure which component generates the Cluster + Local Kueue configuration.

Assignee:: Kevin Hannon

Reporter:: Mike Hepburn

QA Contact:: Alice Nahas

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/11/20 11:06 PM

Updated:: 2026/02/17 5:46 PM

Resolved:: 2026/02/17 5:46 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates