Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-1977

Cannot access JupyterLab server if jupyterhub leader pod dies during spawn

    XMLWordPrintable

Details

    • 3
    • False
    • False
    • No
    • No
    • Hide
      If the JupyterHub leader pod failed while the notebook server was being launched, the user could not access their notebook server until the pod restarted, which took approximately ten minutes. This process has been improved so that the user is redirected to their server when a new leader pod is elected. If this process times out, users see a 504 Gateway Timeout error, and can refresh to access their server.
      Show
      If the JupyterHub leader pod failed while the notebook server was being launched, the user could not access their notebook server until the pod restarted, which took approximately ten minutes. This process has been improved so that the user is redirected to their server when a new leader pod is elected. If this process times out, users see a 504 Gateway Timeout error, and can refresh to access their server.
    • Bug Fix
    • Done
    • No
    • Yes
    • None
    • MODH Sprint 31, MODH Sprint 32, MODH Sprint 33, MODH Sprint 34

    Description

      Description of problem:

      If the jupyterhub leader pod dies while a jupyterlab server is spawning, the user will never be able to access the server (although the server pod is running in the rhods-notebooks namespace).

      Refreshing the page will empty the progress bar for the spawn and remove all events, and the same happens if clicking on the JupyterHub logo on the upper left.
      Clicking on the `Home` button next to the logo brings the user to the "Hub Control Panel" with the "Stop My Server"/"My Server" buttons; clicking "My Server" will bring the user back to the progress bar page, clicking "Stop My Server" opens an error pop-up saying:

      API request failed (400): {username} is pending check, please wait

      which can only be closed. The user is stuck waiting until the server pod is killed by openshift, at which point they will be able to get to the spawner page again.

       

      If the leader pod dies at around the same time the server is ready (i.e. below the progress bar the message "Server ready at /user/{username}/" is shown, the page is trying to redirect to the JL server) the user will be served a 504 Gateway Timeout error page. Refreshing the page will bring the user into the scenario described above.

      Prerequisites (if any, like setup, operators/versions):

      RHODS 1.1.1-41 on OSD

      Steps to Reproduce

      1. Install RHODS
      2. Wait for JH to be online, leader pod is elected
      3. Start spawning a JL server with any image
      4. Kill/Restart the leader pod while the spawn is in progress (or almost finished for the last paragraph)

      Actual results:

      The user is never able to reach the spawned server (which is running in its own pod). The user cannot get back to the spawner page until the timeout ends (standard timeout for failed spawns? Not sure, either way the server pod is destroyed after a few minutes of waiting)

      Expected results:

      User should be able to reach their server even if the leader pod dies during the spawn process. Alternatively, they should be able to kill the server and get back to the spawner without waiting for the timeout

      Reproducibility (Always/Intermittent/Only Once):

      Always

      Build Details:

      RHODS 1.1.1-41 on ODS

      Additional info:

      Attaching the logs for a couple of spawned server pods which contain some error messages; I am almost certain the second pod has an error tied to the leader JH pod not being available, not sure about the other.

      Attachments

        Activity

          People

            vpavlin@redhat.com Vaclav Pavlin (Inactive)
            rhn-support-lgiorgi Luca Giorgi
            Luca Giorgi Luca Giorgi
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: