Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-2802

Optimize Hypershift Reconciliation During OADP Backups

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview

      Implement a high-performance backup mechanism for the HyperShift OADP plugin that ensures data consistency while maintaining cluster availability. By shifting from a long-duration reconciliation freeze to a targeted snapshot approach, we will enable near-continuous operation of hosted cluster controllers. This solution focuses on decoupling the backup process from the global reconciliation loop, ensuring that critical day-two operations like autoscaling and node provisioning remain functional throughout the backup window

      Why is this important?

      To maintain a high-quality user experience and meet Recovery Point Objective (RPO) targets, we must reduce or eliminate the reconciliation pause during backups. Currently, this pause in the hypershift-oadp-plugin implementation ** prevents critical day-two operations, such as autoscaling, for up to 6 to 8 minutes per snapshot (often 20-30 minutes total per backup cycle on Azure). For customers requiring hourly backups, this translates to over two hours of daily downtime for cluster operations, which is unacceptable for production workloads.

      Proposed solution

      • Implement Pre-Backup Hooks: Utilize OADP pre-backup job hooks to execute an etcdctl snapshot directly to disk.
      • Minimize Pause Duration: Transition to using the disk-based etcd snapshot for the Persistent Volume (PV) backup, allowing the reconciliation pause to be either eliminated or restricted only to the brief period required for the initial etcdctl command.
      •  Ensure Consistency: This approach provides a tangible, consistent state for etcd without long-term freezing of cluster controllers

              rhn-support-yli2 Yu Li
              racedoro@redhat.com Ramon Acedo
              None
              Liangquan Li, Martin Gencur, Salvatore Dario Minonne
              Juan Manuel Parrilla Madrid Juan Manuel Parrilla Madrid
              Ge Liu Ge Liu
              Matthew Werner Matthew Werner
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: