-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
None
-
False
-
Testable
-
No
-
-
-
-
-
-
-
No
-
-
Known Issue
-
No
-
Pending
-
None
-
-
Description of problem:
It's possible that during a scale up request from RHODS, the gpu node gets scaled down automatically by the autoscaler because it takes too long to start and get labelled by the nvidia gpu addon.
I've seen this happen when a bug prevented the nvidia addon from working on OCP 4.11 – the gpu-feature-discovery pod was in "Init" state on the node (so the nvidia.com labels on the node were still missing), the spawner was still waiting to schedule the pod, and then the machine/node got destroyed, while the machineSet got scaled down to 0 replicas from 1.
Once this happens, any attempt made to spawn a notebook with gpu after always results in an error:
The scale up is not triggered anymore, and the only way to unblock the cluster and try again is to delete the autoscaler/machine pool entirely from the cluster and create a new one.
Note that "nvidia.com/gpu" is a label applied by the nvidia addon, and in my case the addon was unable to run on the node and label it during the first autoscale request; Further requests seem to "remember" that this autoscaler did not result in a node with the nvidia.com/gpu label.
Prerequisites (if any, like setup, operators/versions):
RHODS 1.19.0-14 on OCP 4.11
Steps to Reproduce
- Install RHODS
- Define GPU autoscaler
- Install Nvidia Addon
- Request server with GPU(s) to trigger autoscale
- (unclear) somehow prevent gpu node from being ready until automatically scaled down
- In my case this happened on its own because of an incompatibility between nvidia addon and OCP 4.11.12
- Request another server with GPU(s) to trigger autoscale
Actual results:
Any further autoscale requests would fail with error "NotTriggerScaleUp" because of "insufficient nvidia.com/gpu"
Expected results:
Autoscale is triggered again – in theory completing successfully and scheduling the server
Reproducibility (Always/Intermittent/Only Once):
Always on affected cluster (one)
Build Details:
RHODS 1.19.0-14, OCP 4.11.12
Workaround:
Delete autoscaling machine pool, create a new one, try requesting server again
In my case the issue was not solved because of the incompatibility between the nvidia addon and OCP 4.11.12, but I was at least able to trigger the autoscaler again