-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.19.z
-
None
Description of problem:
HostedControlPlane deletion is blocked by Karpenter finalizer when kube-apiserver pods are in Pending state, creating a deadlock condition that prevents cluster cleanup. A HyperShift tenant cluster's HostedControlPlane object becomes stuck in deletion when the hypershift.openshift.io/karpenter-finalizer cannot be removed due to a circular dependency: 1. HostedControlPlane is marked for deletion (deletionTimestamp set) 2. kube-apiserver pods are in Pending state (not running) 3. Karpenter pods cannot complete initialization - availability-prober init container waits for kube-apiserver readiness 4. Since Karpenter controller never starts, it cannot perform cleanup operations 5. Karpenter finalizer cannot be removed, blocking HostedControlPlane deletion 6. Control plane resources (hundreds of pods/objects) remain in namespace indefinitely
Version-Release number of selected component (if applicable):
Management Cluster: OpenShift 4.19.17 HyperShift Operator: quay.io/acm-d/rhtap-hypershift-operator@sha256:a929f8882b7568613a9dea60739e2c10d692945224c2ae0724e3b8c0db10cc0c Tenant Cluster: 4.19.15 Platform: AWS (ROSA HCP)
How reproducible:
Observed in production - likely reproducible when: - HostedControlPlane deletion is triggered - kube-apiserver pods cannot schedule (node constraints, resource issues, etc.) - Karpenter is enabled on the cluster
Steps to Reproduce:
1. Create a HyperShift ROSA HCP cluster with Karpenter enabled 2. Delete the HostedCluster/HostedControlPlane 3. Ensure kube-apiserver pods go into Pending state during deletion (simulate with node taints, resource constraints, or node unavailability) 4. Observe Karpenter pods stuck in Init:0/2 state 5. Check HostedControlPlane finalizers remain indefinitely
Actual results:
HostedControlPlane stuck with finalizer for 14+ days:
$ oc get hostedcontrolplane -n <namespace> <name> -o jsonpath='{.metadata.finalizers}'
["hypershift.openshift.io/karpenter-finalizer"]
$ oc get hostedcontrolplane -n <namespace> <name> -o jsonpath='{.metadata.deletionTimestamp}'
2025-12-19T20:20:10Z
Kube-apiserver pods in Pending state:
$ oc get pod -n <namespace> | grep kube-apiserver
kube-apiserver-6cbcf59d5d-2rlhd 0/5 Pending 0 10d
kube-apiserver-6cbcf59d5d-6mt7f 0/5 Pending 0 7h48m
Karpenter pods stuck in Init containers:
$ oc get deployment,pod -n <namespace> | grep karpenter
deployment.apps/karpenter 0/1 1 0 79d
deployment.apps/karpenter-operator 0/1 1 0 79d
pod/karpenter-6cff7dd686-k2cjw 0/2 Init:0/2 0 6h23m
pod/karpenter-788664774d-d77t4 0/2 Init:0/2 0 6h23m
pod/karpenter-operator-54d77db7bc-4w4pq 0/2 Init:0/1 0 6h23m
Karpenter availability-prober logs show continuous connection failures:
{"level":"error","msg":"Request failed, retrying...","error":"Get \"https://kube-apiserver:6443/readyz\": dial tcp 172.30.88.125:6443: connect: connection refused"}
HostedControlPlane status shows degraded:
status:
conditions:
- type: Degraded
status: "True"
reason: UnavailableReplicas
message: karpenter deployment has 2 unavailable replicas
Expected results:
When a HostedControlPlane is marked for deletion and the control plane is non-functional, the HyperShift operator should: 1. Detect that control plane is unavailable and cannot be recovered 2. Remove finalizers that depend on running control plane components after a grace period 3. Allow cleanup to complete even when Karpenter controller cannot run The HostedControlPlane and all associated resources should be cleaned up within a reasonable timeframe (minutes to hours, not days/weeks).
Additional info:
Root Cause: - Karpenter finalizer removal requires Karpenter controller to be running - Karpenter pods have availability-prober init container that blocks on kube-apiserver readiness - When kube-apiserver is unavailable, Karpenter pods never start - HyperShift operator waits indefinitely for finalizer removal - No timeout or fallback cleanup mechanism exists Impact: - HostedControlPlane cannot be deleted - Namespace remains with hundreds of resources - Continued resource consumption on management cluster - Cluster stuck in "deleting" state indefinitely Proposed Solutions: 1. Implement grace period (e.g., 1-2 hours) after which finalizers are forcibly removed if control plane is degraded 2. Add detection for non-functional control plane and skip dependent finalizers 3. Implement alternative cleanup path that doesn't require Karpenter controller to be running Similar patterns may affect other finalizers that depend on control plane availability during deletion.