Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Integrations
Labels:
- groomed

Blocked:
False
Blocked Reason:
None
Ready:
False
Affects Testing:

Testable
Automated:
No
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Regression:
No
Release Note Text:

Hide
== The NVIDIA GPU Operator is incompatible with OpenShift 4.11.12
Provisioning a GPU node on a OpenShift 4.11.12 cluster results in the `nvidia-driver-daemonset` pod getting stuck in a CrashLoopBackOff state.
The NVIDIA GPU Operator is compatible with OpenShift 4.11.9 and 4.11.13.

Show
== The NVIDIA GPU Operator is incompatible with OpenShift 4.11.12 Provisioning a GPU node on a OpenShift 4.11.12 cluster results in the `nvidia-driver-daemonset` pod getting stuck in a CrashLoopBackOff state. The NVIDIA GPU Operator is compatible with OpenShift 4.11.9 and 4.11.13.
Release Note Type:
Known Issue
Target Release:

FUTURE_GA
Test Blocker:
No
Test Coverage:

Pending
Watchlist Impact:
None
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

It's possible that during a scale up request from RHODS, the gpu node gets scaled down automatically by the autoscaler because it takes too long to start and get labelled by the nvidia gpu addon.
I've seen this happen when a bug prevented the nvidia addon from working on OCP 4.11 – the gpu-feature-discovery pod was in "Init" state on the node (so the nvidia.com labels on the node were still missing), the spawner was still waiting to schedule the pod, and then the machine/node got destroyed, while the machineSet got scaled down to 0 replicas from 1.

Once this happens, any attempt made to spawn a notebook with gpu after always results in an error:

The scale up is not triggered anymore, and the only way to unblock the cluster and try again is to delete the autoscaler/machine pool entirely from the cluster and create a new one.

Note that "nvidia.com/gpu" is a label applied by the nvidia addon, and in my case the addon was unable to run on the node and label it during the first autoscale request; Further requests seem to "remember" that this autoscaler did not result in a node with the nvidia.com/gpu label.

Prerequisites (if any, like setup, operators/versions):

RHODS 1.19.0-14 on OCP 4.11

Steps to Reproduce

Install RHODS
Define GPU autoscaler
Install Nvidia Addon
Request server with GPU(s) to trigger autoscale
(unclear) somehow prevent gpu node from being ready until automatically scaled down
1. In my case this happened on its own because of an incompatibility between nvidia addon and OCP 4.11.12
Request another server with GPU(s) to trigger autoscale

Actual results:

Any further autoscale requests would fail with error "NotTriggerScaleUp" because of "insufficient nvidia.com/gpu"

Expected results:

Autoscale is triggered again – in theory completing successfully and scheduling the server

Reproducibility (Always/Intermittent/Only Once):

Always on affected cluster (one)

Build Details:

RHODS 1.19.0-14, OCP 4.11.12

Workaround:

Delete autoscaling machine pool, create a new one, try requesting server again

In my case the issue was not solved because of the incompatibility between the nvidia addon and OCP 4.11.12, but I was at least able to trigger the autoscaler again

Additional info:

Assignee:: Unassigned

Reporter:: Luca Giorgi

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/11/15 5:04 PM

Updated:: 2024/03/15 4:08 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Attachments

Easy Agile Planning Poker

Activity

People

Dates