Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-975

Support assisted z-rollback for OCP EUS versions from 4.16+

XMLWordPrintable

    • BU Product Work
    • False
    • False
    • 0% To Do, 0% In Progress, 100% Done
    • Undefined
    • 0
    • Program Call
    • Understanding what this feature is limited to/capable of and what edge case can happen here are why this needs TE.

      This feature is now re-opened because we want to run z-rollback CI. This feature doesn't block the release of 4.17.This is not going to be exposed as a customer-facing feature and will not be documented within OpenShift documentation.  This is strictly going to be covered as a RH Support guided solution with KCS article providing guidance. A public facing KCS will basically point to contacting Support for help on Z-stream rollback, and Y-stream rollback is not supported.

      NOTE:
      Previously this was closed as "won't do" because didn't have a plan to support y-stream and z-stream rollbacks is standalone openshift.
      For Single node openshift please check TELCOSTRAT-160 . "won't do"  decisions was after further discussion with leadership.
      The e2e tests https://docs.google.com/spreadsheets/d/1mr633YgQItJ0XhbiFkeSRhdLlk6m9vzk1YSKQPHgSvw/edit?gid=0#gid=0 We have identified a few bugs that need to be resolved before the General Availability (GA) release. Ideally, these should be addressed in the final month before GA when all features are development complete. However, asking component teams to commit to fixing critical rollback bugs during this time could potentially delay the GA date.

      ------

       

      Feature Overview (aka. Goal Summary)  

      An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

      Red Hat Support assisted z-stream rollback from 4.16+

      Goals (aka. expected user outcomes)

      The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

      Red Hat Support may, at their discretion, assist customers with z-stream rollback once it’s determined to be the best option for restoring a cluster to the desired state whenever a z-stream rollback compromises cluster functionality.

      Engineering will take a “no regressions, no promises” approach, ensuring there are no major regressions between z-streams, but not testing specific combinations or addressing case-specific bugs.

      Requirements (aka. Acceptance Criteria):

      A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

      • Public Documentation (or KCS?) that explains why we do not advise unassisted z-stream rollback and what to do when a cluster experiences loss of functionality associated with a z-stream upgrade.
      • Internal KCS article that provides a comprehensive plan for troubleshooting and resolving issues introduced after applying a z-stream update, up to and including complete z-stream rollback.
      • Should include alternatives such as limited component rollback (single operator, RHCOS, etc) and workaround options
      • Should include incident response and escalation procedures for all issues incurred during application of a z-stream update so that even if rollback is performed we’re tracking resolution of defects with highest priority
      • Foolproof command to initiate z-stream rollback with Support’s approval, aka a hidden command that ensures we don’t typo the pull spec or initiate A->B->C version changes, only A->B->A
      • Test plan and jobs to ensure that we have high confidence in ability to rollback a z-stream along happy paths
      • Need not be tested on all platforms and configurations, likely metal or vSphere and one foolproof platform like AWS
      • Test should not monitor for disruption since it’s assumed disruption is tolerable during an emergency rollback provided we achieve availability at the end of the operation
      • Engineering agrees to fix bugs which inhibit rollback completion before the current master branch release ships, aka they’ll be filed as blockers for the current master branch release. This means bugs found after 4.N branches may not be fixed until the next release without discussion.

       

      Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed all
      Multi node, Compact (three node) all
      Connected and Restricted Network all
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
      Release payload only all
      Starting with 4.16, including all future releases all
         
         

      While this feature applies to all deployments we will only run a single platform canary test on a high success rate platform, such as AWS. Any specific ecosystems which require more focused testing should bring their own testing.

      Use Cases (Optional):

      Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

      As an admin who has determined that a z-stream update has compromised cluster functionality I have clear documentation that explains that unassisted rollback is not supported and that I should consult with Red Hat Support on the best path forward.

      As a support engineer I have a clear plan for responding to problems which occur during or after a z-stream upgrade, including the process for rolling back specific components, applying workarounds, or rolling the entire cluster back to the previously running z-stream version.

      Questions to Answer (Optional):

      Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

      Should we allow rollbacks whenever an upgrade doesn’t complete? No, not without fully understanding the root cause. If it’s simply a situation where workers are in process of updating but stalled, that should never yield a rollback without credible evidence that rollback will fix that.

      Similar to our “foolproof command” to initiate rollback to previous z-stream should we also craft a foolproof command to override select operators to previous z-stream versions? Part of the goal of the foolproof command is to avoid potential for moving to an unintended version, the same risk may apply at single operator level though impact would be smaller it could still be catastrophic.

      Out of Scope

      High-level list of items that are out of scope.  Initial completion during Refinement status.

      Non-HA clusters, Hosted Control Planes – those may be handled via separately scoped features

      Background

      Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

      Occasionally clusters either upgrade successfully and encounter issues after the upgrade or may run into problems during the upgrade. Many customers assume that a rollback will fix their concerns but without understanding the root cause we cannot assume that’s the case. Therefore, we recommend anyone who has encountered a negative outcome associated with a z-stream upgrade contact support for guidance.

      Customer Considerations

      Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

      It’s expected that customers should have adequate testing and rollout procedures to protect against most regressions, i.e. roll out a z-stream update in pre-production environments where it can be adequately tested prior to updating production environments.

      Documentation Considerations

      Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

      This is largely a documentation effort, i.e. we should create either a KCS article or new documentation section which describes how customers should respond to loss of functionality during or after an upgrade.
      KCS Solution : https://access.redhat.com/solutions/7083335 

      Interoperability Considerations

      Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

      Given we test as many upgrade configurations as possible and for whatever reason the upgrade still encounters problems, we should not strive to comprehensively test all configurations for rollback success. We will only test a limited set of platforms and configurations necessary to ensure that we believe the platform is generally able to roll back a z-stream update.

              rh-ee-smodeel Subin M
              tkatarki@redhat.com Tushar Katarki
              Ali Mobrem, Lalatendu Mohanty, Le Yang, Scott Berens, Vikas Laad, W. Trevor King
              Lalatendu Mohanty Lalatendu Mohanty
              Yang Yang Yang Yang
              Olivia Brown Olivia Brown
              Scott Dodson Scott Dodson (Inactive)
              Lalatendu Mohanty Lalatendu Mohanty
              Subin M Subin M
              Eric Rich Eric Rich
              Votes:
              0 Vote for this issue
              Watchers:
              27 Start watching this issue

                Created:
                Updated:
                Resolved: