Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-2226

[Spike] Hibernation: Hack ClusterOperator statuses to detect staleness

    XMLWordPrintable

Details

    • Story
    • Resolution: Can't Do
    • Critical
    • None
    • openshift-4.11, openshift-4.12, openshift-4.13, openshift-4.14
    • False
    • None
    • False

    Description

      This idea came from sdodson_jira in combination with this enhancement.

      Problem

      When resuming from hibernation, we declare the cluster Running only when all of the spoke's ClusterOperators are healthy. However, we have no way of knowing whether CVO has started properly and checked these statuses, because the lastTransitionTime for a condition is only updated when its status changes.

      Current Workaround

      Today, once we're satisfied Nodes are healthy, we have a hardcoded 2m sleep to give CVO a chance to grind into action. After that, we assume any healthy COs are actually healthy and proceed with the logic of verifying the rest.

      This is terrible. For one thing, it means it's possible we're waiting up to 2m unnecessarily before declaring the cluster resumed. For another, 2m may or may not actually be enough, depending, so we may be getting false positives anyway.

      Suggested Workaround

      Once nodes are healthy, the hibernation (or powerstate) controller iterates over all the ClusterOperators and sets one or more of their status conditions to Unknown. (As a courtesy to human viewers, we can also update the reason and/or message to indicate that we're resuming from hibernation; if a human sees such a message, it means CVO ain't up yet and the status of those COs is stale.)

      Now we can compare the lastTransitionTime of the condition(s) to when the resume was kicked off (the lastTransitionTime of the Hibernating condition) to assure ourselves that their current status is not stale when we check it.

      TODO

      Validate this solution with the CVO and/or OTA teams to make sure what we're doing is allowable and not likely to explode anything.

      Attachments

        Activity

          People

            efried.openshift Eric Fried
            efried.openshift Eric Fried
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0 minutes
                0m
                Logged:
                Time Spent - 6 hours
                6h