• BU Product Work
    • False
    • 0
    • Program Call

      Goal

      Note: This is an internal improvement. There are no user-facing deliverables.

      There are a few areas to cover for Disaster Recovery (DR):

      • Finish rewriting the existing DR Bash scripts in Go
      • Add guardrails to code that will not allow the customer to cause additional damage to cluster during disaster recovery.
      • Cleanup technical debt from MCO repo and installer.

      Why is this important?

      When a cluster has an event that for example results in quorum loss this is a very stressful situation. If we can provide a very clean solution to this event with well thought out tools the admin will be pleased.

      So we don't run into customer situations like this
      https://docs.google.com/document/d/1ULGQARWdxjujWpSyncY0pKrUG9OcT0PlhEmYVwrPEAE/edit?ts=5eb18ea3

      Scenarios

      1. customer has a cluster event that causes loss of quorum

            [OCPSTRAT-529] Improve disaster recovery test coverage for etcd

            Matthew Werner added a comment - - edited

            wcabanba@redhat.com If this has no user-facing deliverable, I will remove the doc-req label to avoid confusion with our closed doc epic. 

            Matthew Werner added a comment - - edited wcabanba@redhat.com If this has no user-facing deliverable, I will remove the doc-req label to avoid confusion with our closed doc epic. 

            Eric Rich added a comment -

            rhn-support-mdineen  this feature isn't targeted (no FixVersion set) for a release; is your query to check what is / isn't blocked (over collecting items)? 

            Eric Rich added a comment - rhn-support-mdineen   this feature isn't targeted (no FixVersion set) for a release; is your query to check what is / isn't blocked (over collecting items)? 

            As I've just moved out ETCD-81 from this, over to OCPBU-252, my recommendation would be to close this feature out in favor of the following features that are more scoped in their goals for improving disaster recovery. Or remove the fix version for this since it overlaps with the others.

            • OCPBU-252 Automated backups of etcd (local destination)
            • OCPBU-254 Automated backups of etcd (external targets)
            • OCPBU-255 Enhance recovery procedure for full control plane failure
            • OCPBU-256 Automated restore of etcd from external target (investigate)

            /cc dwest@redhat.com 

            Haseeb Tariq added a comment - As I've just moved out ETCD-81 from this, over to OCPBU-252 , my recommendation would be to close this feature out in favor of the following features that are more scoped in their goals for improving disaster recovery. Or remove the fix version for this since it overlaps with the others. OCPBU-252  Automated backups of etcd (local destination) OCPBU-254  Automated backups of etcd (external targets) OCPBU-255  Enhance recovery procedure for full control plane failure OCPBU-256 Automated restore of etcd from external target (investigate) /cc dwest@redhat.com  

            assigning to wcabanba@redhat.com as the PM for this. 

            Tushar Katarki added a comment - assigning to wcabanba@redhat.com as the PM for this. 

            Per email:

            The following changes will be made to the impacted issues:

            Feature: These issues will be moved over to the OCPPLAN Jira project as Features

            Feature Request: By default these issues will be moved OCPPLAN Jira project as Features.

            QE Task: These issues will be converted into Task issue types and will remain in your team project.

            Nicole Wilker added a comment - Per email: The following changes will be made to the impacted issues: Feature: These issues will be moved over to the OCPPLAN Jira project as Features Feature Request: By default these issues will be moved OCPPLAN Jira project as Features. QE Task: These issues will be converted into Task issue types and will remain in your team project.

            Removing DR from the 4.10 plan at this time.

            Wallace Lewis added a comment - Removing DR from the 4.10 plan at this time.

            Move current DR scripts from the MCO Repo into the Cluster Etcd Operator Repo – Completed

            Finish rewriting the current DR Bash scripts in Go – Pending

            Document or script on how to handle the backup process of encrypted Etcd stores in a way that maintains the encryption key separately and securely – Completed

            Automated backups tracked via https://issues.redhat.com/browse/ETCD-81


            Effort Estimate: S

            Anandnatraj Chandramohan (Inactive) added a comment - - edited Move current DR scripts from the MCO Repo into the Cluster Etcd Operator Repo – Completed Finish rewriting the current DR Bash scripts in Go – Pending Document or script on how to handle the backup process of encrypted Etcd stores in a way that maintains the encryption key separately and securely – Completed Automated backups tracked via https://issues.redhat.com/browse/ETCD-81 Effort Estimate: S

            We should be able to perform this through the OpenShift primitives and not require anyone to ssh to nodes as they would for management of systems outside of a cluster context.

            Timothy Rees added a comment - We should be able to perform this through the OpenShift primitives and not require anyone to ssh to nodes as they would for management of systems outside of a cluster context.

            Mike Barrett added a comment - - edited

            We are working towards being able to recover a member automatically.   Need CI/CD test for things breaking and recovering.  

            Mike Barrett added a comment - - edited We are working towards being able to recover a member automatically.   Need CI/CD test for things breaking and recovering.  

              wcabanba@redhat.com William Caban
              blomquisg Greg Blomquist
              Dean West
              Ge Liu Ge Liu
              Matthew Werner Matthew Werner
              David Eads David Eads
              Eric Rich Eric Rich
              Votes:
              12 Vote for this issue
              Watchers:
              44 Start watching this issue

                Created:
                Updated:
                Resolved: