Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8779

Bypass PDB for Unhealthy HCPs

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • ARO, Hosted Control Planes
    • None
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      1. Proposed Title:

      Hypershift: Automatic PDB bypass mechanism for unhealthy Hosted Control Planes during management cluster upgrades

      2. Nature and Description of the Request:

      ARO-HCP (and other managed Hypershift offerings) requires a mechanism to bypass Pod Disruption Budgets (PDBs) for Hosted Control Planes that are in a fundamentally unhealthy state, specifically when those PDBs block management cluster node drain operations during upgrades to infrastructure.

      Proposed behavior:

      • Services (ARO/ROSA/etc) should detect when a Hosted Control Plane is fundamentally unhealthy (e.g., complete kube-apiserver loss, persistent crash loops, unrecoverable state)
      • When an unhealthy HCP is identified by a Service and a management cluster upgrade/drain is pending as a result, Services need a way to notify Hypershift that it should allow bypassing specific hosted cluster PDBs by some kind of signal they can set on that HCP instance ("force-drain allowed"/etc) that removes the PDB for that HCP instance until the signal is removed.

      3. Business Requirements:

      • Live-service availability: Management clusters must be upgradeable on a predictable schedule to address security vulnerabilities, apply RFEs, and maintain SLAs. A single broken customer control plane cannot be allowed to block upgrades for the entire management cluster.
      • Security posture: Delayed upgrades due to stuck PDBs extend exposure windows for CVEs affecting management cluster components.
      • Operational efficiency: SRE teams currently require manual intervention to identify and work around these situations, increasing toil and incident response time.

      4. Affected Packages/Components:

      • hypershift (core operator logic, PDB creation/management)
      • hypershift/control-plane-operator (health detection, PDB lifecycle)
      • HostedCluster API (potential new field for bypass authorization)

       

      NOTE:

      There is an existing workaround:

      `kubectl patch hostedcluster -n "${HCNS}" "${CLUSTER_NAME}" -p '{"spec":{"pausedUntil":"true"}}' --type="merge"`

      However - this is not ideal as it stops all reconciliation of anything related to the cluster. We would rather use a finer dial in production.

              jboutaud@redhat.com Jerome Boutaud
              bbergen@redhat.com Brendan Bergen
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                None
                None