-
Bug
-
Resolution: Done
-
Blocker
-
RHODS_1.1_GA
-
2
-
False
-
False
-
None
-
No
-
-
-
-
-
-
1.1.1
-
No
-
Undefined
-
No
-
Yes
-
None
-
-
MODH Sprint 27, MODH Sprint 28
Description of problem:
When installing RHODS 1.0.17, either one (or both) between the PyTorch or Tensorflow build pods will time out and fail.
This results in the two images not being available in the JH spawner.
As a workaround, restarting the build manually seems to consistently solve the issue, but it is not a long term solution to the problem.
Prerequisites (if any, like setup, operators/versions):
RHODS 1.0.17 running on OCP 4.7.19 on PSI
Steps to Reproduce
- Install RHODS
- Wait for the preliminary CUDA builds to complete
- Wait for the minimal-gpu build to complete
- Either PyTorch or Tensorflow will get built next, usually one of the two will fail due to timeout, one time both of them failed for the same reason.
Actual results:
Either or both of the two builds fail due to a timeout
Expected results:
Builds are completed and images are available
Reproducibility (Always/Intermittent/Only Once):
Always in my testing, it might be intermittent and I've simply been unlucky
Build Details:
RHODS 1.0.17 running on OCP 4.7.19 on PSI
Additional info:
Attaching pod events and logs:
s2i-tensorflow-gpu-cuda-11.0.3-notebook-1-build-sti-build.log
- blocks
-
RHODS-1497 As a QE, I want to have a minimal test case for the PyTorch image
-
- Closed
-
- is related to
-
RHODS-1548 Cuda build chain is broken in v1.0.17
-
- Closed
-
- relates to
-
RHODS-1536 Unable to finish CUDA Builds in Open Shift Dedicated due to unschedulable pod
-
- Closed
-