Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-4223

Random restarts for Traefik and JupyterHub pods

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Major Major
    • None
    • None
    • Workbenches
    • False
    • None
    • False
    • No
    • No
    • No
    • None

      Description of problem:

      on a OSD cluster in stage with RHODS installed as an add-on (1.11.0-5) (together with the GPU add-on) I am seeing random but somewhat frequent restarts for JupyterHub and Traefik pods.
      I was only able to see a restart happening live, and one of the Traefik containers in one of the three traefik pods failing seems to have caused a restart for the current JH leader pod.

      There is also one pod in the GPU add-on namespace (controller-manager) with roughly the same number of restarts. I am attaching its logs as well, as it might be what is causing the issue (given the fact that we have not seen these restarts in clusters where the gpu add-on was not installed).

      Prerequisites (if any, like setup, operators/versions):

      RHODS 1.11.0-5 installed as add-on on OSD

      Steps to Reproduce

      1. Install RHODS
      2. (Maybe install GPU add-on?)
      3. Keep using as normal, restarts seem to have started a few hours (4?) after RHODS was first installed

      Actual results:

      Multiple restarts (5+) on JH and Traefik pods, at seemingly random times

      Expected results:

      No restarts for JH, a ~couple of restarts for Traefik during install

      Reproducibility (Always/Intermittent/Only Once):

      Observed on one cluster only

      Build Details:

      OSD running OCP 4.10 latest, RHODS 1.11.0-5, GPU add-on v.1.10.1

      Workaround:

      Additional info:

              Unassigned Unassigned
              rhn-support-lgiorgi Luca Giorgi
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: