-
Feature
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
Feature Overview
Implement a high-performance backup mechanism for the HyperShift OADP plugin that ensures data consistency while maintaining cluster availability. By shifting from a long-duration reconciliation freeze to a targeted snapshot approach, we will enable near-continuous operation of hosted cluster controllers. This solution focuses on decoupling the backup process from the global reconciliation loop, ensuring that critical day-two operations like autoscaling and node provisioning remain functional throughout the backup window
Why is this important?
To maintain a high-quality user experience and meet Recovery Point Objective (RPO) targets, we must reduce or eliminate the reconciliation pause during backups. Currently, this pause in the hypershift-oadp-plugin implementation ** prevents critical day-two operations, such as autoscaling, for up to 6 to 8 minutes per snapshot (often 20-30 minutes total per backup cycle on Azure). For customers requiring hourly backups, this translates to over two hours of daily downtime for cluster operations, which is unacceptable for production workloads.
Proposed solution
- Implement Pre-Backup Hooks: Utilize OADP pre-backup job hooks to execute an etcdctl snapshot directly to disk.
- Minimize Pause Duration: Transition to using the disk-based etcd snapshot for the Persistent Volume (PV) backup, allowing the reconciliation pause to be either eliminated or restricted only to the brief period required for the initial etcdctl command.
- Ensure Consistency: This approach provides a tangible, consistent state for etcd without long-term freezing of cluster controllers