-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
-
Not Selected
-
False
-
False
-
-
-
0
-
0
-
0% To Do, 100% In Progress, 0% Done
Feature Overview (mandatory - Complete while in New status)
This feature introduces a reliable backup & restore procedure for the RHOSO control plane Galera cluster. The solution will leverage OpenShift-native capabilities (ideally via Custom Resources) to provide a unified and consistent operational experience.
The feature is necessary to maintain operational parity with RHOSP 17.1, fulfilling the fundamental need of operators and site reliability engineers (SREs) to safeguard the critical state of the OpenStack control plane. Without this, a catastrophic database failure could lead to unrecoverable data loss and extended downtime, which is unacceptable for a production-grade cloud platform. This matters to the user because it ensures business continuity and data integrity for their OpenStack cloud deployed on OpenShift.
Goals (mandatory - Complete while in New status)
To implement a robust, automated, and OpenShift-integrated mechanism for backing up and restoring the RHOSO control plane's Galera database, ensuring an equivalent level of data protection and procedural confidence as provided in RHOSP 17.1.
Requirements (mandatory -_ Complete while in Refinement status):
A list of specific needs, capabilities, or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the Feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement | Notes | isMVP? |
R1: Automated Backup Trigger | The solution must provide the ability to automatically trigger backups of the Galera database instance(s) within the control plane. | Yes |
R2: OpenShift-Native Integration | Backup capability must be integrated with OpenShift/K8s primitives, ideally leveraging Custom Resources (CRs) for configuration and orchestration instead of external orchestration tools. | Yes |
R3: Non-Disruptive Backup | The backup process must be implemented in a manner that ensures minimal to no disruption to the availability and performance of the running Galera cluster. | Yes |
R4: Restore Procedure Documentation | A complete, step-by-step procedure for restoring the Galera cluster from a previously taken backup must be provided. | Yes |
R5: Testable Design | The backup and restore design must be structured to allow for thorough validation via both unit tests and end-to-end (e2e) deployment/integration tests. | Yes |
R6: Automated Restore Testing | The provided restore procedure must be exercised and validated automatically (i.e., as part of the CI/CD pipeline/e2e tests) to ensure its continued reliability. | Yes |
Done - Acceptance Criteria (mandatory - Complete while in Refinement status):
The Feature is considered accepted when the following criteria are met:
- A GaleraBackup Custom Resource (CR) (or similar OpenShift primitive) is implemented and can successfully trigger a backup of the RHOSO control plane Galera database.
- The automated backup process completes without causing any observed downtime or disruption to the availability of the Galera service endpoints.
- The backup design has passed comprehensive unit tests and its functionality is validated via a dedicated end-to-end (e2e) deployment test.
- A documented procedure for restoring the Galera database from a successful backup is available.
- The documented restore procedure has been successfully exercised automatically via an e2e test, demonstrating that the restored database is functional and consistent.
- The feature delivers the equivalent operational capability for Galera backup and restore that was present in RHOSP 17.1.
Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):
Main Success Scenario: Automated Scheduled Backup
- Operator configures a GaleraBackup CR (or similar), specifying a schedule (e.g., daily at 02:00 UTC).
- The RHOSO Operator's controller sees the schedule and, at the specified time, orchestrates a non-disruptive snapshot/backup of the Galera cluster.
- The backup data is successfully stored in the configured location (e.g., PVC or S3 bucket).
- A GaleraBackup object (or equivalent status field) is updated with a Success status and the location/timestamp of the backup.
Disaster Recovery Scenario: Restore from Backup
- A catastrophic failure occurs, rendering the current Galera cluster unusable.
- Operator follows the provided Restore Procedure.
- The procedure involves provisioning a new Galera cluster (or re-initializing the existing one) and pointing it to the latest successful backup data.
- The new/re-initialized cluster starts, reads the backup data, and reaches a Ready state.
- OpenStack control plane services reconnect to the restored database and resume operation.
Out of Scope __(Initial completion while in Refinement status):
- Backup and restore of any data outside of the RHOSO control plane's core Galera database (e.g., Ceph data, image data, application configuration).
- Fine-grained, per-tenant database backup/restore.
- Automated, self-healing restoration; the feature only requires a tested procedure for the operator to follow.
- Support for database technologies other than the default Galera/MySQL used by RHOSO.
Documentation Considerations __(Initial completion while in Refinement status):
- Operator Guide: Dedicated, clear documentation for the Galera Restore Procedure with required prerequisite checks (e.g., required free space, cluster status).
- CRD Reference: Full reference documentation for the new GaleraBackup CRD, including all fields, schedules, and status indicators.
- Troubleshooting: Common failure scenarios for both backup (e.g., storage full) and restore (e.g., data inconsistency) and mitigation steps.
- Integration: Documentation linking this feature to the overall disaster recovery story for RHOSO.
Questions to Answer __(Initial completion while in Refinement status):
Include a list of refinement / architectural questions that may need to be answered before coding can begin.
<your text here>
Background and Strategic Fit (Initial completion while in Refinement status):
Explained in the attached Outcome
Customer Considerations __(Initial completion while in Refinement status):
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.
<your text here>
Team Sign Off (Completion while in Planning status)
- All required Epics (known at the time) are linked to the this Feature
- All required Stories, Tasks (known at the time) for the most immediate Epics have been created and estimated
- Add - Reviewers name, Team Name
- Acceptance == Feature as “Ready” - well understood and scope is clear - Acceptance Criteria (scope) is elaborated, well defined, and understood
- Note: Only set FixVersion/s: on a Feature if the delivery team agrees they have the capacity and have committed that capability for that milestone
Reviewed By | Team Name | Accepted | Notes |
- …