-
Bug
-
Resolution: Done
-
Major
-
None
-
False
-
None
-
False
-
Testable
-
No
-
-
-
-
-
-
-
quay.io/modh/rhods-operator-live-catalog:1.27.0-rhods-8152
-
No
-
No
-
Pending
-
None
-
-
-
RHODS 1.27, RHODS 1.28
Description of problem:
I was contacted by a customer, where the RHODS environment was no longer spawning notebooks successfully.
Upon investigation, it was found that many pods in the namespace redhat-ods-applications were in a pending state, and had been for a few hours. The Notebook Controller pods were pending. (at least of them were).
This situation was due to a few factors
- RHODS 1.25 (cloud version) had been released earlier in the day
- the pods were trying to roll over (during which the requirements increase)
- the only machine pool where these pods could run was at capacity and not configured for auto-scaling.
I told the customer to add a couple machines to the default node pool, and the pods finally got to a running state, and the update to RHODS 1.25 finally completed. And users were once again able to spawn notebooks.
Prerequisites (if any, like setup, operators/versions):
Steps to Reproduce
- Deploy RHODS
- fill up the cluster so that no new pods can be added
- trigger an upgrade for RHODS
Actual results:
RHODS gets stuck during the upgrade and stops working.
Expected results:
RHODS gets stuck during the upgrade and stops working. But SRE notices, connects to the cluster, adds a couple nodes so that RHODS can finish its update, then takes away the nodes.
Reproducibility (Always/Intermittent/Only Once):
I think this is the second time I see this.
Build Details:
RHODS 1.24->1.25
Workaround:
Customer should maintain some "headroom" in their cluster, but we would need to tell them what that headroom is. Or at least, customer should have some autoscaling headroom.
Additional info:
- links to
- mentioned on