-
Story
-
Resolution: Can't Do
-
Critical
-
None
-
openshift-4.11, openshift-4.12, openshift-4.13, openshift-4.14
-
False
-
None
-
False
-
-
This idea came from sdodson_jira in combination with this enhancement.
Problem
When resuming from hibernation, we declare the cluster Running only when all of the spoke's ClusterOperators are healthy. However, we have no way of knowing whether CVO has started properly and checked these statuses, because the lastTransitionTime for a condition is only updated when its status changes.
Current Workaround
Today, once we're satisfied Nodes are healthy, we have a hardcoded 2m sleep to give CVO a chance to grind into action. After that, we assume any healthy COs are actually healthy and proceed with the logic of verifying the rest.
This is terrible. For one thing, it means it's possible we're waiting up to 2m unnecessarily before declaring the cluster resumed. For another, 2m may or may not actually be enough, depending, so we may be getting false positives anyway.
Suggested Workaround
Once nodes are healthy, the hibernation (or powerstate) controller iterates over all the ClusterOperators and sets one or more of their status conditions to Unknown. (As a courtesy to human viewers, we can also update the reason and/or message to indicate that we're resuming from hibernation; if a human sees such a message, it means CVO ain't up yet and the status of those COs is stale.)
Now we can compare the lastTransitionTime of the condition(s) to when the resume was kicked off (the lastTransitionTime of the Hibernating condition) to assure ourselves that their current status is not stale when we check it.
TODO
Validate this solution with the CVO and/or OTA teams to make sure what we're doing is allowable and not likely to explode anything.
- relates to
-
OCPSTRAT-543 Shutdown/Resume of managed OSD/ROSA clusters
- Closed
- links to