Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: v1.2.0
Affects Version/s: RHODS_1.1_GA
Component/s: Workbenches
Labels:
- P0
- defered-from-1.1
- eng
- groomed
- p0
- qe_mustfix

Story Points:
3
Blocked:
False
Ready:
False
Automated:
No
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Regression:
No
Release Note Text:

Hide
If the JupyterHub leader pod failed while the notebook server was being launched, the user could not access their notebook server until the pod restarted, which took approximately ten minutes. This process has been improved so that the user is redirected to their server when a new leader pod is elected. If this process times out, users see a 504 Gateway Timeout error, and can refresh to access their server.

Show
If the JupyterHub leader pod failed while the notebook server was being launched, the user could not access their notebook server until the pod restarted, which took approximately ten minutes. This process has been improved so that the user is redirected to their server when a new leader pod is elected. If this process times out, users see a 504 Gateway Timeout error, and can refresh to access their server.
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Release:

RHODS_1.2.0_GA
Test Blocker:
No
Test Coverage:

Yes
Watchlist Impact:
None
Git Pull Request:
https://github.com/red-hat-data-services/jupyterhub-odh/pull/63

Sprint:
MODH Sprint 31, MODH Sprint 32, MODH Sprint 33, MODH Sprint 34

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

If the jupyterhub leader pod dies while a jupyterlab server is spawning, the user will never be able to access the server (although the server pod is running in the rhods-notebooks namespace).

Refreshing the page will empty the progress bar for the spawn and remove all events, and the same happens if clicking on the JupyterHub logo on the upper left.
Clicking on the `Home` button next to the logo brings the user to the "Hub Control Panel" with the "Stop My Server"/"My Server" buttons; clicking "My Server" will bring the user back to the progress bar page, clicking "Stop My Server" opens an error pop-up saying:

API request failed (400): {username} is pending check, please wait

which can only be closed. The user is stuck waiting until the server pod is killed by openshift, at which point they will be able to get to the spawner page again.

If the leader pod dies at around the same time the server is ready (i.e. below the progress bar the message "Server ready at /user/{username}/" is shown, the page is trying to redirect to the JL server) the user will be served a 504 Gateway Timeout error page. Refreshing the page will bring the user into the scenario described above.

Prerequisites (if any, like setup, operators/versions):

RHODS 1.1.1-41 on OSD

Steps to Reproduce

Install RHODS
Wait for JH to be online, leader pod is elected
Start spawning a JL server with any image
Kill/Restart the leader pod while the spawn is in progress (or almost finished for the last paragraph)

Actual results:

The user is never able to reach the spawned server (which is running in its own pod). The user cannot get back to the spawner page until the timeout ends (standard timeout for failed spawns? Not sure, either way the server pod is destroyed after a few minutes of waiting)

Expected results:

User should be able to reach their server even if the leader pod dies during the spawn process. Alternatively, they should be able to kill the server and get back to the spawner without waiting for the timeout

Reproducibility (Always/Intermittent/Only Once):

Always

Build Details:

RHODS 1.1.1-41 on ODS

Additional info:

Attaching the logs for a couple of spawned server pods which contain some error messages; I am almost certain the second pod has an error tied to the leader JH pod not being available, not sure about the other.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

api-fail.png
12 kB
2021/09/30 4:32 PM
jupyterhub-nb-ldap-2duser4-notebook.log
9 kB
2021/09/30 4:32 PM
jupyterhub-nb-ldap-2duser4-notebook2.log
7 kB
2021/09/30 4:32 PM

is related to

RHODS-2231 [Spike] Investigate possible fix for Gateway Timeout error during server spawn (HA)

mentioned on

Merge request - Added release notes for RHODS 1.2.0 (RHODS-2215, RHODS-2216), noted RHODS-1977...

Assignee:: Vaclav Pavlin (Inactive)

Reporter:: Luca Giorgi

QA Contact:: Luca Giorgi

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2021/09/30 4:33 PM

Updated:: 2023/07/10 9:26 AM

Resolved:: 2021/11/08 6:43 PM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates