Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-948

CUDA build fails/gets evicted on PSI clusters

XMLWordPrintable

    • 2
    • False
    • False
    • Hide

      CUDA build is consistently successful in PSI clusters

      Show
      CUDA build is consistently successful in PSI clusters
    • No
    • 1.0.16
    • No
    • Undefined
    • Yes
    • Yes
    • None
    • IDH Sprint 3

      Description of problem:

      in PSI clusters, CUDA builds consistently fail or get evicted (to my knowledge, always because of ephemereal storage).

      This results in the CUDA, PyTorch and TensorFlow images never being available in the Spawner UI.

      Prerequisites (if any, like setup, operators/versions):

      Any RHODS version shipping with the cuda build, PSI OCP cluster.
      Currently observed in RHODS 1.0.14 on OpenShift 4.7.4

      Steps to Reproduce

      1. Install RHODS
      2. Wait for the CUDA build to complete/fail/get evicted

      Actual results:

      CUDA build pods always get evicted or downright fail. This results in the CUDA, PyTorch and TensorFlow images never becoming available in the spawner.

      Expected results:

      CUDA build is successful, CUDA, PyTorch and TensorFlow images available to be spawned.

      Reproducibility (Always/Intermittent/Only Once):

      Always

      Build Details:

      RHODS 1.0.14
      OCP cluster in PSI, running OpenShift 4.7.4
      The CUDA build pods are called 11.0.3-cuda-s2i-xxx-ubi8-1-build

       

      Additional info:

      In my last installation attempt, the following build pods successfully completed:
      11.0.3-cuda-s2i-core-ubi8-1-build
      11.0.3-cuda-s2i-base-ubi8-1-build
      11.0.3-cuda-s2i-py38-ubi8-1-build 

      The following has been evicted:

      11.0.3-cuda-s2i-thoth-ubi8-py38-1-build

      Usually, the first build pod (11.0.3-cuda-s2i-core-ubi8-1-build) would get evicted.

              acorvin@redhat.com Alex Corvin
              rhn-support-lgiorgi Luca Giorgi
              Pablo Felix Pablo Felix (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: