Type: Epic
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Epic Name:
Continuously Tested and Guaranteed Graceful Shutdown/Restart of OpenShift 4.X Clusters
Blocked:
False
Ready:
False
Epic Status:
To Do
Flagged:

Impediment

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Priority Data:
PX Impact Score:
PX Review Complete:

Epic Goal

Guarantee, through continuous testing and validation, the official documented OpenShift 4.X "Graceful Restart" process, leveraged in OpenShift Hive and Red Hat Advanced Cluster Management as cluster "Hibernation". This process should, historically, take far less time than a full provision, preferably and historically sub-10-minutes.

Why is this important?

Firstly, this functionality is documented as part of Disaster Recovery scenarios and needs to be continuously validated to prevent regression.
Second, this functionality is used as documented and shipped as part of OpenShift Hive and Red Hat Advanced Cluster Management for Kubernetes called "Hibernation" which allows users to power off their provisioned clusters to reduce cost, and create pools of hibernating-and-waiting clusters that can be resumed from that state in far less time than a fresh provision (typically under 10 minutes).
OpenShift Dedicated and ROSA (both also use Hive) want to enable a "cluster hibernation" feature for development type clusters (power down on the weekend) that relies on graceful restart. Today the restart is too error prone to enable the feature we've build.

Scenarios

The customer has a cluster which needs powered off and back on (typically due to a migration from one AWS physical host to another, for example) with full recovery on the other end of the restart.
A Hive user or RHACM customer utilizes the UI or API to hibernate a cluster, or create clusters in a clusterpool that are hibernated, and wishes to resume them rapidly to continue use of the cluster or use the cluster from a pool for CI or other purposes.

Acceptance Criteria

CI - MUST be running successfully with tests automated

Previous Work (Optional):

https://issues.redhat.com/browse/API-931

Open questions::

We have had frequent problems resuming clusters that have been hibernating for more than 24h. How can we run CI on this condition?

From efried.openshift :
In terms of developing a periodic test that runs in the CI env, as I suggested above I think that looks like:

Set up a clusterpool in CI that's dedicated to this test only (so we have controlled/predictable hibernation times). size=1 ought to work, if we're starting off with a single periodic. Include "hive.openshift.io/resume-skips-cluster-operators": "true" in the clusterpool's .spec.labels. Make sure runningCount=0.
Set up a periodic to run every X hours, where X is greater than the amount of hibernation time you want to vet by a comfortable margin that takes into account the time it takes for the pool to provision a replacement. E.g. if you want to test against a 24h hibernation, you would want the periodic to run every ~25h (or more). The extra hour is to account for 40m provision time plus a bit of a buffer. (Realistically I think you would need to go to the next 24h boundary in this case, because I think it's hard to tell cron "every 25h".)
The content of the periodic should poll ClusterOperators (example) and pass if they become healthy within a predefined threshold of time; otherwise fail.

Note that this only vets the part of the resume flow around Cluster Operators.

From Gurney.Buchanan@ibm.com

If hive exposes an endpoint for provision failure after some period of time we could also start capturing that metric in our (ACM) CI environment and share that data. Once we have those datapoints - we can start capturing must-gather and cluster logs on the clusters that enter this state - OCP teams (probably starting with API) will have to investigate and RCA from there to find the root cause!

From efried.openshift: We're working on these metrics via https://issues.redhat.com/browse/HIVE-1630. Adding the above use case to that card could help move it along.

Slack Thread Link

Record of Issues with Resume

We've documented two previous resume issues from:

Early OpenShift 4.7 (FCs)
Early OpenShift 4.8
Early OpenShift 4.9 (not documented in a BZ, we opened this epic as a result of total resume failures)

These issues typically fit into one of the following behaviors: * Ingress hangs/stuck in a bad state requiring manual intervention

OAuth gets stuck in a loop after resume requiring manual intervention with the kube:admin user
Cluster becomes totally unreachable due to a variety of issues - these are difficult to capture/track because CI usually "ignores" or abandons these clusters and retries with another cluster

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

is related to

API-1603 Fallback (Protocol) for Emergency Certificate Rotation

Closed

OCPSTRAT-102 Ability to Hibernate (suspend and resume) ROSA

Closed

OCPSTRAT-403 [Tech Preview] Automated backups of etcd (local destination)

Closed

OCPSTRAT-539 Enhance recovery procedure for full control plane failure

Closed

OCPSTRAT-543 Shutdown/Resume of managed OSD/ROSA clusters

Closed

OCPSTRAT-529 Improve disaster recovery test coverage for etcd

Closed

relates to

OCPBUGS-30741 kube-scheduled certificates not correctly rotated after restart of cluster powered of for 2 months

New

links to

KCS 4218311: How to shutdown all OpenShift hosts safely in OpenShift 4?

(1 is related to, 1 relates to, 1 links to)

1.	Docs Tracker	Closed	Stefan Schimanski (Inactive)
2.	QE Tracker	Closed	Ke Wang
3.	TE Tracker	Closed	Eric Rich

Details

Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Open questions::

Record of Issues with Resume

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates

Hide