Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: RHODS_1.1_GA
Affects Version/s: RHODS_1.1_GA
Component/s: Workbenches
Labels:
- Groomed
- IDH-Team

Story Points:
2
Epic Link:
JupyterHub HA
Blocked:
False
Ready:
False
Acceptance Criteria:
None
Automated:
No
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Fixed in Build:
1.1.1-34
Regression:
No
Release Note Text:

Hide
There is a small chance that due to network issues, the Jupyterhub pods won't be able to see each other which would cause the non-leader pods to assume there is no leader and become the leader itself. In this case, we would have multiple pods running the Jupyterhub server and can potentially cause issues during spawn operations. This error shouldn't cause any issues for existing user notebook pods.
A simple fix would be to delete the Jupyterhub server pods and let them perform the leader election again.

Show
There is a small chance that due to network issues, the Jupyterhub pods won't be able to see each other which would cause the non-leader pods to assume there is no leader and become the leader itself. In this case, we would have multiple pods running the Jupyterhub server and can potentially cause issues during spawn operations. This error shouldn't cause any issues for existing user notebook pods. A simple fix would be to delete the Jupyterhub server pods and let them perform the leader election again.
Target Release:

RHODS_1.1_GA
Test Blocker:
No
Test Coverage:

Yes
Watchlist Impact:
None
Git Pull Request:
https://github.com/opendatahub-io/odh-manifests/pull/473
Market:

Sprint:
IDH Sprint 7, IDH Sprint 8, IDH Sprint 9

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

As we have implemented High Availability on Jupyterhub, we shifted from 1 to 3 containers with the Leader Election strategy.

With this implementation, we could bump into inconsistencies given that one Pod could be thinking he's still the leader elected, but others could have replaced it due to network problems.

Implement a check to detect if a pod running is still the leader elected, and if not, delete it.

In the following diagram we can detail the error:

If somehow the pods bump into network problems, it might trigger a new election while the old container still thinks it's the leader, once the network issues are fixed, there will be two leaders, as there is no mechanism to probe the leader election an restart the pod.

[SPIKE] We have already thought about using a liveness probe script -> https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Working branch -> https://github.com/lucferbux/odh-manifests/tree/buxfix-jupyterhub-leader-election

The main issue with this exploration is that we could isolate the pod with the readinessProbe (not passing traffic ) and restart the container with the livenessProbe, but all of this only works on the container level, not the pod level. This won't acknowledge the problem in the sidecar container of the image.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

leader election issue.png
235 kB
2021/08/17 10:14 AM
leader election issue-1.png
235 kB
2021/08/17 10:19 AM

is duplicated by

RHODS-1796 JupyterHub HA not working as expected

Closed

relates to

RHODS-1796 JupyterHub HA not working as expected

Closed

Assignee:: Lucas Fernandez Aragon

Reporter:: Lucas Fernandez Aragon

QA Contact:: Luca Giorgi

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2021/08/03 10:15 AM

Updated:: 2023/02/17 9:27 PM

Resolved:: 2021/09/17 7:56 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide