-
Bug
-
Resolution: Done
-
Critical
-
RHODS_1.1_GA
-
2
-
False
-
False
-
-
No
-
-
-
-
-
-
1.0.16
-
No
-
Undefined
-
Yes
-
Yes
-
None
-
-
IDH Sprint 3
Description of problem:
in PSI clusters, CUDA builds consistently fail or get evicted (to my knowledge, always because of ephemereal storage).
This results in the CUDA, PyTorch and TensorFlow images never being available in the Spawner UI.
Prerequisites (if any, like setup, operators/versions):
Any RHODS version shipping with the cuda build, PSI OCP cluster.
Currently observed in RHODS 1.0.14 on OpenShift 4.7.4
Steps to Reproduce
- Install RHODS
- Wait for the CUDA build to complete/fail/get evicted
Actual results:
CUDA build pods always get evicted or downright fail. This results in the CUDA, PyTorch and TensorFlow images never becoming available in the spawner.
Expected results:
CUDA build is successful, CUDA, PyTorch and TensorFlow images available to be spawned.
Reproducibility (Always/Intermittent/Only Once):
Always
Build Details:
RHODS 1.0.14
OCP cluster in PSI, running OpenShift 4.7.4
The CUDA build pods are called 11.0.3-cuda-s2i-xxx-ubi8-1-build
Additional info:
In my last installation attempt, the following build pods successfully completed:
11.0.3-cuda-s2i-core-ubi8-1-build
11.0.3-cuda-s2i-base-ubi8-1-build
11.0.3-cuda-s2i-py38-ubi8-1-build
The following has been evicted:
11.0.3-cuda-s2i-thoth-ubi8-py38-1-build
Usually, the first build pod (11.0.3-cuda-s2i-core-ubi8-1-build) would get evicted.