Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-543

Shutdown/Resume of managed OSD/ROSA clusters

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None
    • False
    • OCPSTRAT-12(OUTCOME STUB) Cloud platform activation/retention for Managed OpenShift (ROSA/ARO/OSD non-Hypershift enhancements)
    • 80
    • 80% 80%
    • 0
    • 0

    Description

      Feature Overview (aka. Goal Summary)  

      Support enabling XCMBU-125 Tech Preview: Enable Shutdown/Resume of OSD/ROSA clusters, whereby the OSD/ROSA tenant administrator/owner of an OSD/ROSA cluster wants to shutdown and resume the cluster and underlying cloud instances for cost savings or when a cluster does not need to run.

      Goals (aka. expected user outcomes)

      As the OSD/ROSA tenant administrator or owner of an OSD/ROSA cluster, I want to be able to shutdown and resume it and the underlying cloud VM instances so that I can save money on the cloud costs when a cluster does not need to be running.

      Requirements (aka. Acceptance Criteria):

      1. An org admin or cluster owner can shutdown a running OSD/ROSA cluster using OCM.
      2. An org admin or cluster owner can shutdown a running OSD cluster using an api.openshift.com.
      3. An org admin or cluster owner can restart a shutdown OSD/ROSA cluster using OCM.
      4. An org admin or cluster owner can restart a shutdown OSD cluster using api.openshift.com.
      5. An org admin or cluster owner can hibernate/resume a ROSA cluster using ROSA CLI. 
      6. A shutdown cluster shows as such in the OCM cluster list and details.
      7. Shutdown validates the cluster is not on OCP 4.4
      8. Age of the cluster being shutdown must be more than 24 hours and less than 9 years.
      9. Hibernation must not be allowed in the previous resume on the cluster has happened < 2 hours ago.
      10. Maximum period the cluster can be shutdown is 60 days. At the time of shutting down the cluster, an info response and a log entry must be added to highlight this. Three warnings (10 days, 5 days and 3 days) must be issued before 60 days elapse since shutting down. 
      11. Shutdown must not be allowed if the current OCP version will EOL within 60 days - maximum period allowed for hibernation. 
      12. Cluster shutdown must be blocked if the MachineConfigPools are in updating state.
      13. If possible, a cluster in the process of shutting down or restarting shows as such in the OCM cluster list and details.
      14. OCM/CLI must prompt warning that 'the shutdown can leave cluster irreparable and must not be done on production clusters or when no back up of data/workloads is unavailable' and receive user confirmation.
      15. The cluster's ClusterSync resource must be deleted on resume to force reconciliation of day-2 configuration and reduce the Hive reconciliation time (From up to 2 hours to 45 minutes)
      16. A cluster history log entry is added when a cluster shutdown or restart is initiated.
      17. A cluster history log entry is added when a cluster shutdown or restart is completed.
      18. All open alerts in pagerduty for the cluster should be resolved
      19. DMS alerts (cluster gone missing) should be paused for the duration of the hibernation
      20. Mark the cluster in Limited Support until:
          1. 45 minutes pass since start
          2. cluster runs a supported OCP version
          3. no ClusterOperator is in degraded state
          4. No managed Day-2 configurations fail to apply to the cluster
      21. KCS article exists clearly outlining the terms, limitations, and support repercussions  of of shutdown/resume that is linked in CLI/OCM UI, Logs etc.

      Use Cases (Optional):

      In Scope Use Scenarios

      • Test/Dev clusters that can be stopped overnight or weekend (time-bound) - For Tech Preview

      Out of Scope Use Scenarios

      • Passive DR Prod clusters that can be stopped until a planned failover is needed - Post GA
      • Migration of clusters requiring pre-creation (e.g., OCP 3 to OCP 4, Cluster network changes like adding ipv6  that are not supported in-place)
      • Sand-box cluster(s) where multiple internal teams can try/test operators, deployments etc only when needed{}

      Questions to Answer (Optional):

      Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

      Out of Scope

      High-level list of items that are out of scope.  Initial completion during Refinement status.

      Background

      See XCMBU-125 Tech Preview: Enable Shutdown/Resume of OSD/ROSA clusters

      TP: Hibernate Product Reqs v2.0

      Slack: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1684184829681999

      Customer Considerations

      Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

      Documentation Considerations

      Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

      Interoperability Considerations

      Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

      Attachments

        Issue Links

          Activity

            People

              julim Ju Lim
              julim Ju Lim
              Balachandran Chandrasekaran, Eric Fried, Mike Worthington, Scott Dodson
              Eric Fried Eric Fried
              Eric Fried Eric Fried
              Scott Dodson Scott Dodson
              Ju Lim Ju Lim
              Eric Rich Eric Rich
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: