Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: RHODS_1.28.0_GA, RHODS_1.27.0_GA
Affects Version/s: None
Component/s: Install Upgrade Uninstall
Labels:
- eng
- groomed

Epic Link:
Review RHODS resource requirements
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Acceptance Criteria:
None
Affects Testing:

Testable
Automated:
No
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Fixed in Build:
quay.io/modh/rhods-operator-live-catalog:1.27.0-rhods-8152
Regression:
No
Target Release:

RHODS_1.28.0_GA
Test Blocker:
No
Test Coverage:

Pending
Watchlist Impact:
None
Git Pull Request:
https://github.com/red-hat-data-services/odh-manifests/pull/348
Intelligence Requested:
Market:

Sprint:
RHODS 1.27, RHODS 1.28

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

I was contacted by a customer, where the RHODS environment was no longer spawning notebooks successfully.

Upon investigation, it was found that many pods in the namespace redhat-ods-applications were in a pending state, and had been for a few hours. The Notebook Controller pods were pending. (at least of them were).

This situation was due to a few factors

RHODS 1.25 (cloud version) had been released earlier in the day
the pods were trying to roll over (during which the requirements increase)
the only machine pool where these pods could run was at capacity and not configured for auto-scaling.

I told the customer to add a couple machines to the default node pool, and the pods finally got to a running state, and the update to RHODS 1.25 finally completed. And users were once again able to spawn notebooks.

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Deploy RHODS
fill up the cluster so that no new pods can be added
trigger an upgrade for RHODS

Actual results:

RHODS gets stuck during the upgrade and stops working.

Expected results:

RHODS gets stuck during the upgrade and stops working. But SRE notices, connects to the cluster, adds a couple nodes so that RHODS can finish its update, then takes away the nodes.

Reproducibility (Always/Intermittent/Only Once):

I think this is the second time I see this.

Build Details:

RHODS 1.24->1.25

Workaround:

Customer should maintain some "headroom" in their cluster, but we would need to tell them what that headroom is. Or at least, customer should have some autoscaling headroom.

Additional info:

links to

feat: Add KubeDeploymentRolloutStuck

Perform update of the notebook controller without surge

red-hat-data-services/odh-deployer#334: Add an alert for detecting stuck deployment rollouts

mentioned on

Merge request - Updated 4 upstream sources

Merge request - Updated 6 upstream sources

Merge request - Updated 9 upstream sources

(1 mentioned on)

Assignee:: Max Gautier (Inactive)

Reporter:: Erwan Granger

QA Contact:: Tarun Kumar

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/04/24 9:09 PM

Updated:: 2023/10/06 6:26 AM

Resolved:: 2023/05/24 8:43 AM

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Workaround:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide