-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
Continuously Tested and Guaranteed Graceful Shutdown/Restart of OpenShift 4.X Clusters
-
False
-
False
-
To Do
-
Impediment
Epic Goal
- Guarantee, through continuous testing and validation, the official documented OpenShift 4.X "Graceful Restart" process, leveraged in OpenShift Hive and Red Hat Advanced Cluster Management as cluster "Hibernation". This process should, historically, take far less time than a full provision, preferably and historically sub-10-minutes.
Why is this important?
- Firstly, this functionality is documented as part of Disaster Recovery scenarios and needs to be continuously validated to prevent regression.
- Second, this functionality is used as documented and shipped as part of OpenShift Hive and Red Hat Advanced Cluster Management for Kubernetes called "Hibernation" which allows users to power off their provisioned clusters to reduce cost, and create pools of hibernating-and-waiting clusters that can be resumed from that state in far less time than a fresh provision (typically under 10 minutes).
- OpenShift Dedicated and ROSA (both also use Hive) want to enable a "cluster hibernation" feature for development type clusters (power down on the weekend) that relies on graceful restart. Today the restart is too error prone to enable the feature we've build.
Scenarios
- The customer has a cluster which needs powered off and back on (typically due to a migration from one AWS physical host to another, for example) with full recovery on the other end of the restart.
- A Hive user or RHACM customer utilizes the UI or API to hibernate a cluster, or create clusters in a clusterpool that are hibernated, and wishes to resume them rapidly to continue use of the cluster or use the cluster from a pool for CI or other purposes.
Acceptance Criteria
- CI - MUST be running successfully with tests automated
Previous Work (Optional):
Open questions::
- We have had frequent problems resuming clusters that have been hibernating for more than 24h. How can we run CI on this condition?
From efried.openshift :
In terms of developing a periodic test that runs in the CI env, as I suggested above I think that looks like:
- Set up a clusterpool in CI that's dedicated to this test only (so we have controlled/predictable hibernation times). size=1 ought to work, if we're starting off with a single periodic. Include "hive.openshift.io/resume-skips-cluster-operators": "true" in the clusterpool's .spec.labels. Make sure runningCount=0.
- Set up a periodic to run every X hours, where X is greater than the amount of hibernation time you want to vet by a comfortable margin that takes into account the time it takes for the pool to provision a replacement. E.g. if you want to test against a 24h hibernation, you would want the periodic to run every ~25h (or more). The extra hour is to account for 40m provision time plus a bit of a buffer. (Realistically I think you would need to go to the next 24h boundary in this case, because I think it's hard to tell cron "every 25h".)
- The content of the periodic should poll ClusterOperators (example) and pass if they become healthy within a predefined threshold of time; otherwise fail.
Note that this only vets the part of the resume flow around Cluster Operators.
If hive exposes an endpoint for provision failure after some period of time we could also start capturing that metric in our (ACM) CI environment and share that data. Once we have those datapoints - we can start capturing must-gather and cluster logs on the clusters that enter this state - OCP teams (probably starting with API) will have to investigate and RCA from there to find the root cause!
From efried.openshift: We're working on these metrics via https://issues.redhat.com/browse/HIVE-1630. Adding the above use case to that card could help move it along.
Record of Issues with Resume
We've documented two previous resume issues from:
- Early OpenShift 4.7 (FCs)
- Early OpenShift 4.8
- Early OpenShift 4.9 (not documented in a BZ, we opened this epic as a result of total resume failures)
These issues typically fit into one of the following behaviors: * Ingress hangs/stuck in a bad state requiring manual intervention
- OAuth gets stuck in a loop after resume requiring manual intervention with the kube:admin user
- Cluster becomes totally unreachable due to a variety of issues - these are difficult to capture/track because CI usually "ignores" or abandons these clusters and retries with another cluster
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- is related to
-
API-1603 Fallback (Protocol) for Emergency Certificate Rotation
- Release Pending
-
OCPSTRAT-539 Enhance recovery procedure for full control plane failure
- Release Pending
-
OCPSTRAT-102 Ability to Hibernate (suspend and resume) ROSA
- Closed
-
OCPSTRAT-403 Automated backups of etcd (local destination)
- Closed
-
OCPSTRAT-543 Shutdown/Resume of managed OSD/ROSA clusters
- Closed
-
OCPSTRAT-529 Improve disaster recovery test coverage for etcd
- Closed
- relates to
-
OCPBUGS-30741 kube-scheduled certificates not correctly rotated after restart of cluster powered of for 2 months
- New
- links to
1.
|
Docs Tracker | Closed | Stefan Schimanski (Inactive) | ||
2.
|
QE Tracker | Closed | Ke Wang | ||
3.
|
TE Tracker | Closed | Eric Rich |