Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Integrations
Labels:
- eng
- groomed

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Acceptance Criteria:
None
Affects:

Release Notes
Automated:
No
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Regression:
No
Release Note Text:

Hide
== When using the Nvidia GPU add-on, more nodes than needed are created by the Node Autoscaler
When a pod cannot be scheduled due to insufficient available resources, the Node Autoscaler creates a new node. There is a delay until the newly created node receives the relevant GPU workload. Consequently, the pod cannot be scheduled and the Node Autoscaler's continuously creates additional new nodes until one of the nodes is ready to receive the GPU workload. For more information about this issue, see link:https://access.redhat.com/solutions/6055181[When using the Nvidia GPU Operator, more nodes than needed are created by the Node Autoscaler].

*Workaround*: Apply the `cluster-api/accelerator` label in `machineset.spec.template.spec.metadata`. This causes the autoscaler to consider those nodes as unready until the GPU driver has been deployed.
endif::[]

Show
== When using the Nvidia GPU add-on, more nodes than needed are created by the Node Autoscaler When a pod cannot be scheduled due to insufficient available resources, the Node Autoscaler creates a new node. There is a delay until the newly created node receives the relevant GPU workload. Consequently, the pod cannot be scheduled and the Node Autoscaler's continuously creates additional new nodes until one of the nodes is ready to receive the GPU workload. For more information about this issue, see link: https://access.redhat.com/solutions/6055181 [When using the Nvidia GPU Operator, more nodes than needed are created by the Node Autoscaler]. *Workaround*: Apply the `cluster-api/accelerator` label in `machineset.spec.template.spec.metadata`. This causes the autoscaler to consider those nodes as unready until the GPU driver has been deployed. endif::[]
Release Note Type:
Known Issue
Release Note Status:
Done
Target Release:

FUTURE_GA
Test Blocker:
No
Test Coverage:

Pending
Watchlist Impact:
None
PX Impact Score:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We have an issue with autoscaling nodes with gpus.

If a user requests a notebook with atleast one gpu and no currently running nodes are able to accept it, a gpu node gets correctly scaled up. However by the time the nvidia dcgm exporter is up on this node, the spawned notebook is still reporting lack of gpus so the cluster autoscales several more nodes until atleast one node has an exporter running and accepts the notebook pod.

It is a known issue: https://access.redhat.com/solutions/6055181

Related RHODS issue: https://issues.redhat.com/browse/RHODS-4617

Assignee:: Unassigned

Reporter:: Maros Roman (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2022/10/14 9:57 AM

Updated:: 2025/06/11 11:45 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide