Loading...

XML

Word

Printable

Type: Story
Resolution: Can't Do
Priority: Critical
Fix Version/s: None
Affects Version/s: openshift-4.11, openshift-4.12, openshift-4.13, openshift-4.14
Labels:
- ServiceDelivery
- hibernate

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

This idea came from sdodson_jira in combination with this enhancement.

Problem

When resuming from hibernation, we declare the cluster Running only when all of the spoke's ClusterOperators are healthy. However, we have no way of knowing whether CVO has started properly and checked these statuses, because the lastTransitionTime for a condition is only updated when its status changes.

Current Workaround

Today, once we're satisfied Nodes are healthy, we have a hardcoded 2m sleep to give CVO a chance to grind into action. After that, we assume any healthy COs are actually healthy and proceed with the logic of verifying the rest.

This is terrible. For one thing, it means it's possible we're waiting up to 2m unnecessarily before declaring the cluster resumed. For another, 2m may or may not actually be enough, depending, so we may be getting false positives anyway.

Suggested Workaround

Once nodes are healthy, the hibernation (or powerstate) controller iterates over all the ClusterOperators and sets one or more of their status conditions to Unknown. (As a courtesy to human viewers, we can also update the reason and/or message to indicate that we're resuming from hibernation; if a human sees such a message, it means CVO ain't up yet and the status of those COs is stale.)

Now we can compare the lastTransitionTime of the condition(s) to when the resume was kicked off (the lastTransitionTime of the Hibernating condition) to assure ourselves that their current status is not stale when we check it.

TODO

Validate this solution with the CVO and/or OTA teams to make sure what we're doing is allowable and not likely to explode anything.

relates to

OCPSTRAT-543 Shutdown/Resume of managed OSD/ROSA clusters

Closed

links to

openshift/hive#2032: Hibernation: No more pause, hack ClusterOperators

openshift/hive#2035: Hibernation: No more pause, hack ClusterOperators

Assignee:: Eric Fried

Reporter:: Eric Fried

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/05/18 10:56 PM

Updated:: 2023/12/19 3:50 PM

Resolved:: 2023/08/02 10:58 AM

Estimated:

Not Specified

Remaining:

Logged:

Details

Description

Problem

Current Workaround

Suggested Workaround

TODO

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Time Tracking