Uploaded image for project: 'OpenStack Strategy'
  1. OpenStack Strategy
  2. RHOSSTRAT-1013

RHOSO Control Plane Database Backup and Restore Parity with RHOSP 17.1

XMLWordPrintable

    • Icon: Feature Feature
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • RHOSSTRAT-1012RHOSO Backup & Restore Parity with RHOSP 17.1
    • Not Selected
    • False
    • False
    • Hide

      None

      Show
      None
    • 0
    • 0
    • 0% To Do, 100% In Progress, 0% Done

      Feature Overview (mandatory - Complete while in New status)

      This feature introduces a reliable backup & restore procedure for the RHOSO control plane Galera cluster. The solution will leverage OpenShift-native capabilities (ideally via Custom Resources) to provide a unified and consistent operational experience.

      The feature is necessary to maintain operational parity with RHOSP 17.1, fulfilling the fundamental need of operators and site reliability engineers (SREs) to safeguard the critical state of the OpenStack control plane. Without this, a catastrophic database failure could lead to unrecoverable data loss and extended downtime, which is unacceptable for a production-grade cloud platform. This matters to the user because it ensures business continuity and data integrity for their OpenStack cloud deployed on OpenShift.

      Goals (mandatory - Complete while in New status)
      To implement a robust, automated, and OpenShift-integrated mechanism for backing up and restoring the RHOSO control plane's Galera database, ensuring an equivalent level of data protection and procedural confidence as provided in RHOSP 17.1.

      Requirements (mandatory -_ Complete while in Refinement status):
      A list of specific needs, capabilities, or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the Feature shifts. If a non MVP requirement slips, it does not shift the feature.

      Requirement Notes isMVP?
      R1: Automated Backup Trigger The solution must provide the ability to automatically trigger backups of the Galera database instance(s) within the control plane. Yes
      R2: OpenShift-Native Integration Backup capability must be integrated with OpenShift/K8s primitives, ideally leveraging Custom Resources (CRs) for configuration and orchestration instead of external orchestration tools. Yes
      R3: Non-Disruptive Backup The backup process must be implemented in a manner that ensures minimal to no disruption to the availability and performance of the running Galera cluster. Yes
      R4: Restore Procedure Documentation A complete, step-by-step procedure for restoring the Galera cluster from a previously taken backup must be provided. Yes
      R5: Testable Design The backup and restore design must be structured to allow for thorough validation via both unit tests and end-to-end (e2e) deployment/integration tests. Yes
      R6: Automated Restore Testing The provided restore procedure must be exercised and validated automatically (i.e., as part of the CI/CD pipeline/e2e tests) to ensure its continued reliability. Yes

      Done - Acceptance Criteria (mandatory - Complete while in Refinement status):

      The Feature is considered accepted when the following criteria are met:

      1. A GaleraBackup Custom Resource (CR) (or similar OpenShift primitive) is implemented and can successfully trigger a backup of the RHOSO control plane Galera database.
      1. The automated backup process completes without causing any observed downtime or disruption to the availability of the Galera service endpoints.
      1. The backup design has passed comprehensive unit tests and its functionality is validated via a dedicated end-to-end (e2e) deployment test.
      1. A documented procedure for restoring the Galera database from a successful backup is available.
      1. The documented restore procedure has been successfully exercised automatically via an e2e test, demonstrating that the restored database is functional and consistent.
      1. The feature delivers the equivalent operational capability for Galera backup and restore that was present in RHOSP 17.1.

      Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):

      Main Success Scenario: Automated Scheduled Backup

      1. Operator configures a GaleraBackup CR (or similar), specifying a schedule (e.g., daily at 02:00 UTC).
      1. The RHOSO Operator's controller sees the schedule and, at the specified time, orchestrates a non-disruptive snapshot/backup of the Galera cluster.
      1. The backup data is successfully stored in the configured location (e.g., PVC or S3 bucket).
      1. A GaleraBackup object (or equivalent status field) is updated with a Success status and the location/timestamp of the backup.

      Disaster Recovery Scenario: Restore from Backup

      1. A catastrophic failure occurs, rendering the current Galera cluster unusable.
      1. Operator follows the provided Restore Procedure.
      1. The procedure involves provisioning a new Galera cluster (or re-initializing the existing one) and pointing it to the latest successful backup data.
      1. The new/re-initialized cluster starts, reads the backup data, and reaches a Ready state.
      1. OpenStack control plane services reconnect to the restored database and resume operation.

      Out of Scope __(Initial completion while in Refinement status):

      • Backup and restore of any data outside of the RHOSO control plane's core Galera database (e.g., Ceph data, image data, application configuration).
      • Fine-grained, per-tenant database backup/restore.
      • Automated, self-healing restoration; the feature only requires a tested procedure for the operator to follow.
      • Support for database technologies other than the default Galera/MySQL used by RHOSO.

      Documentation Considerations __(Initial completion while in Refinement status):

      • Operator Guide: Dedicated, clear documentation for the Galera Restore Procedure with required prerequisite checks (e.g., required free space, cluster status).
      • CRD Reference: Full reference documentation for the new GaleraBackup CRD, including all fields, schedules, and status indicators.
      • Troubleshooting: Common failure scenarios for both backup (e.g., storage full) and restore (e.g., data inconsistency) and mitigation steps.
      • Integration: Documentation linking this feature to the overall disaster recovery story for RHOSO.

      Questions to Answer __(Initial completion while in Refinement status):
      Include a list of refinement / architectural questions that may need to be answered before coding can begin.
      <your text here>

      Background and Strategic Fit (Initial completion while in Refinement status):
      Explained in the attached Outcome

      Customer Considerations __(Initial completion while in Refinement status):
      Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.
      <your text here>

      Team Sign Off (Completion while in Planning status)

      • All required Epics (known at the time) are linked to the this Feature
      • All required Stories, Tasks (known at the time) for the most immediate Epics have been created and estimated
      • Add - Reviewers name, Team Name
      • Acceptance == Feature as “Ready” - well understood and scope is clear - Acceptance Criteria (scope) is elaborated, well defined, and understood
      • Note: Only set FixVersion/s: on a Feature if the delivery team agrees they have the capacity and have committed that capability for that milestone
      Reviewed By Team Name Accepted Notes
             
             
             
             

       

              Unassigned Unassigned
              pnavarro@redhat.com Pedro Navarro Perez
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: