Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-1526

PyTorch and Tensorflow build pods timing out

XMLWordPrintable

    • MODH Sprint 27, MODH Sprint 28

      Description of problem:

      When installing RHODS 1.0.17, either one (or both) between the PyTorch or Tensorflow build pods will time out and fail.
      This results in the two images not being available in the JH spawner.

      As a workaround, restarting the build manually seems to consistently solve the issue, but it is not a long term solution to the problem.

      Prerequisites (if any, like setup, operators/versions):

      RHODS 1.0.17 running on OCP 4.7.19 on PSI

      Steps to Reproduce

      1. Install RHODS
      2. Wait for the preliminary CUDA builds to complete
      3. Wait for the minimal-gpu build to complete
      4. Either PyTorch or Tensorflow will get built next, usually one of the two will fail due to timeout, one time both of them failed for the same reason.

      Actual results:

      Either or both of the two builds fail due to a timeout

      Expected results:

      Builds are completed and images are available

      Reproducibility (Always/Intermittent/Only Once):

      Always in my testing, it might be intermittent and I've simply been unlucky

      Build Details:

      RHODS 1.0.17 running on OCP 4.7.19 on PSI

      Additional info:

      Attaching pod events and logs:
      s2i-tensorflow-gpu-cuda-11.0.3-notebook-1-build-sti-build.log

              tmckay@redhat.com Trevor Mckay (Inactive)
              rhn-support-lgiorgi Luca Giorgi
              Luca Giorgi Luca Giorgi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: