Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: RHODS_1.1_GA
Affects Version/s: RHODS_1.1_GA
Component/s: Build and Release
Labels:
- Groomed
- IDH-Team

Story Points:
2
Epic Link:
Support notebook images
Blocked:
False
Ready:
False
Acceptance Criteria:

Hide

CUDA build is consistently successful in PSI clusters

Show
CUDA build is consistently successful in PSI clusters
Automated:
No
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Fixed in Build:
1.0.16
Regression:
No
Release Note Text:
Undefined
Target Release:

RHODS_1.1_GA
Test Blocker:
Yes
Test Coverage:

Yes
Watchlist Impact:
None
Git Pull Request:
https://github.com/red-hat-data-services/odh-manifests/pull/118
Market:

Sprint:
IDH Sprint 3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

in PSI clusters, CUDA builds consistently fail or get evicted (to my knowledge, always because of ephemereal storage).

This results in the CUDA, PyTorch and TensorFlow images never being available in the Spawner UI.

Prerequisites (if any, like setup, operators/versions):

Any RHODS version shipping with the cuda build, PSI OCP cluster.
Currently observed in RHODS 1.0.14 on OpenShift 4.7.4

Steps to Reproduce

Install RHODS
Wait for the CUDA build to complete/fail/get evicted

Actual results:

CUDA build pods always get evicted or downright fail. This results in the CUDA, PyTorch and TensorFlow images never becoming available in the spawner.

Expected results:

CUDA build is successful, CUDA, PyTorch and TensorFlow images available to be spawned.

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

RHODS 1.0.14
OCP cluster in PSI, running OpenShift 4.7.4
The CUDA build pods are called 11.0.3-cuda-s2i-xxx-ubi8-1-build

Additional info:

In my last installation attempt, the following build pods successfully completed:
11.0.3-cuda-s2i-core-ubi8-1-build
11.0.3-cuda-s2i-base-ubi8-1-build
11.0.3-cuda-s2i-py38-ubi8-1-build

The following has been evicted:

11.0.3-cuda-s2i-thoth-ubi8-py38-1-build

Usually, the first build pod (11.0.3-cuda-s2i-core-ubi8-1-build) would get evicted.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screenshot 2021-07-28 at 12-45-36 Pods · Red Hat OpenShift Container Platform.png
2021/07/28 10:46 AM
137 kB
Jorge Garcia Oncins

Assignee:: Alex Corvin

Reporter:: Luca Giorgi

QA Contact:: Pablo Felix (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2021/06/09 4:05 PM

Updated:: 2023/02/17 9:37 PM

Resolved:: 2021/06/29 6:09 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates