-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
1. Proposed Title:
Hypershift: Automatic PDB bypass mechanism for unhealthy Hosted Control Planes during management cluster upgrades
2. Nature and Description of the Request:
ARO-HCP (and other managed Hypershift offerings) requires a mechanism to bypass Pod Disruption Budgets (PDBs) for Hosted Control Planes that are in a fundamentally unhealthy state, specifically when those PDBs block management cluster node drain operations during upgrades to infrastructure.
Proposed behavior:
- Services (ARO/ROSA/etc) should detect when a Hosted Control Plane is fundamentally unhealthy (e.g., complete kube-apiserver loss, persistent crash loops, unrecoverable state)
- When an unhealthy HCP is identified by a Service and a management cluster upgrade/drain is pending as a result, Services need a way to notify Hypershift that it should allow bypassing specific hosted cluster PDBs by some kind of signal they can set on that HCP instance ("force-drain allowed"/etc) that removes the PDB for that HCP instance until the signal is removed.
3. Business Requirements:
- Live-service availability: Management clusters must be upgradeable on a predictable schedule to address security vulnerabilities, apply RFEs, and maintain SLAs. A single broken customer control plane cannot be allowed to block upgrades for the entire management cluster.
- Security posture: Delayed upgrades due to stuck PDBs extend exposure windows for CVEs affecting management cluster components.
- Operational efficiency: SRE teams currently require manual intervention to identify and work around these situations, increasing toil and incident response time.
4. Affected Packages/Components:
- hypershift (core operator logic, PDB creation/management)
- hypershift/control-plane-operator (health detection, PDB lifecycle)
- HostedCluster API (potential new field for bypass authorization)
NOTE:
There is an existing workaround:
`kubectl patch hostedcluster -n "${HCNS}" "${CLUSTER_NAME}" -p '{"spec":{"pausedUntil":"true"}}' --type="merge"`
However - this is not ideal as it stops all reconciliation of anything related to the cluster. We would rather use a finer dial in production.