-
Feature
-
Resolution: Obsolete
-
Critical
-
None
-
openshift-4.12, openshift-4.11, openshift-4.13, openshift-4.14
-
Strategic Portfolio Work
-
False
-
-
False
-
OCPSTRAT-12(OUTCOME STUB) Cloud platform activation/retention for Managed OpenShift (ROSA/ARO/OSD non-Hypershift enhancements)
-
0% To Do, 0% In Progress, 100% Done
-
0
Feature Overview (aka. Goal Summary)
Support enabling XCMBU-125 Tech Preview: Enable Shutdown/Resume of OSD/ROSA clusters, whereby the OSD/ROSA tenant administrator/owner of an OSD/ROSA cluster wants to shutdown and resume the cluster and underlying cloud instances for cost savings or when a cluster does not need to run.
Goals (aka. expected user outcomes)
As the OSD/ROSA tenant administrator or owner of an OSD/ROSA cluster, I want to be able to shutdown and resume it and the underlying cloud VM instances so that I can save money on the cloud costs when a cluster does not need to be running.
Requirements (aka. Acceptance Criteria):
1. An org admin or cluster owner can shutdown a running OSD/ROSA cluster using OCM.
2. An org admin or cluster owner can shutdown a running OSD cluster using an api.openshift.com.
3. An org admin or cluster owner can restart a shutdown OSD/ROSA cluster using OCM.
4. An org admin or cluster owner can restart a shutdown OSD cluster using api.openshift.com.
5. An org admin or cluster owner can hibernate/resume a ROSA cluster using ROSA CLI.
6. A shutdown cluster shows as such in the OCM cluster list and details.
7. Shutdown validates the cluster is not on OCP 4.4
8. Age of the cluster being shutdown must be more than 24 hours and less than 9 years.
9. Hibernation must not be allowed in the previous resume on the cluster has happened < 2 hours ago.
10. Maximum period the cluster can be shutdown is 60 days. At the time of shutting down the cluster, an info response and a log entry must be added to highlight this. Three warnings (10 days, 5 days and 3 days) must be issued before 60 days elapse since shutting down.
11. Shutdown must not be allowed if the current OCP version will EOL within 60 days - maximum period allowed for hibernation.
12. Cluster shutdown must be blocked if the MachineConfigPools are in updating state.
13. If possible, a cluster in the process of shutting down or restarting shows as such in the OCM cluster list and details.
14. OCM/CLI must prompt warning that 'the shutdown can leave cluster irreparable and must not be done on production clusters or when no back up of data/workloads is unavailable' and receive user confirmation.
15. The cluster's ClusterSync resource must be deleted on resume to force reconciliation of day-2 configuration and reduce the Hive reconciliation time (From up to 2 hours to 45 minutes)
16. A cluster history log entry is added when a cluster shutdown or restart is initiated.
17. A cluster history log entry is added when a cluster shutdown or restart is completed.
18. All open alerts in pagerduty for the cluster should be resolved
19. DMS alerts (cluster gone missing) should be paused for the duration of the hibernation
20. Mark the cluster in Limited Support until:
1. 45 minutes pass since start
2. cluster runs a supported OCP version
3. no ClusterOperator is in degraded state
4. No managed Day-2 configurations fail to apply to the cluster
21. KCS article exists clearly outlining the terms, limitations, and support repercussions of of shutdown/resume that is linked in CLI/OCM UI, Logs etc.
Use Cases (Optional):
In Scope Use Scenarios
- Test/Dev clusters that can be stopped overnight or weekend (time-bound) - For Tech Preview
Out of Scope Use Scenarios
- Passive DR Prod clusters that can be stopped until a planned failover is needed - Post GA
- Migration of clusters requiring pre-creation (e.g., OCP 3 to OCP 4, Cluster network changes like adding ipv6 that are not supported in-place)
- Sand-box cluster(s) where multiple internal teams can try/test operators, deployments etc only when needed{}
Questions to Answer (Optional):
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Out of Scope
High-level list of items that are out of scope. Initial completion during Refinement status.
Background
See XCMBU-125 Tech Preview: Enable Shutdown/Resume of OSD/ROSA clusters
TP: Hibernate Product Reqs v2.0
Slack: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1684184829681999
Customer Considerations
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Documentation Considerations
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Interoperability Considerations
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
- is related to
-
HIVE-2226 [Spike] Hibernation: Hack ClusterOperator statuses to detect staleness
- Closed
-
OCPSTRAT-102 Ability to Hibernate (suspend and resume) ROSA
- Closed
- relates to
-
API-1376 OpenShift 4.X supports an official process to shut down, restart, and resume an OpenShift cluster from a powered off state, this function should be continuously validated, supported, and guaranteed for consumers for DR and lifecycle use-cases
- New