Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-70301

Tenant cluster not deprovisioning due to karpenter finalizer

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.19.z
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • None
    • AUTOSCALE - Sprint 285
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      HostedControlPlane deletion is blocked by Karpenter finalizer when kube-apiserver pods are in Pending state, creating a deadlock condition that prevents cluster cleanup.
      
      A HyperShift tenant cluster's HostedControlPlane object becomes stuck in deletion when the hypershift.openshift.io/karpenter-finalizer cannot be removed due to a circular dependency:
      
      1. HostedControlPlane is marked for deletion (deletionTimestamp set)
      2. kube-apiserver pods are in Pending state (not running)
      3. Karpenter pods cannot complete initialization - availability-prober init container waits for kube-apiserver readiness
      4. Since Karpenter controller never starts, it cannot perform cleanup operations
      5. Karpenter finalizer cannot be removed, blocking HostedControlPlane deletion
      6. Control plane resources (hundreds of pods/objects) remain in namespace indefinitely
      

      Version-Release number of selected component (if applicable):

      Management Cluster: OpenShift 4.19.17
      HyperShift Operator: quay.io/acm-d/rhtap-hypershift-operator@sha256:a929f8882b7568613a9dea60739e2c10d692945224c2ae0724e3b8c0db10cc0c
      Tenant Cluster: 4.19.15
      Platform: AWS (ROSA HCP)
      

      How reproducible:

      Observed in production - likely reproducible when:
      - HostedControlPlane deletion is triggered
      - kube-apiserver pods cannot schedule (node constraints, resource issues, etc.)
      - Karpenter is enabled on the cluster
      

      Steps to Reproduce:

      1. Create a HyperShift ROSA HCP cluster with Karpenter enabled
      2. Delete the HostedCluster/HostedControlPlane
      3. Ensure kube-apiserver pods go into Pending state during deletion (simulate with node taints, resource constraints, or node unavailability)
      4. Observe Karpenter pods stuck in Init:0/2 state
      5. Check HostedControlPlane finalizers remain indefinitely
      

      Actual results:

      HostedControlPlane stuck with finalizer for 14+ days:
      
      $ oc get hostedcontrolplane -n <namespace> <name> -o jsonpath='{.metadata.finalizers}'
      ["hypershift.openshift.io/karpenter-finalizer"]
      
      $ oc get hostedcontrolplane -n <namespace> <name> -o jsonpath='{.metadata.deletionTimestamp}'
      2025-12-19T20:20:10Z
      
      Kube-apiserver pods in Pending state:
      $ oc get pod -n <namespace> | grep kube-apiserver
      kube-apiserver-6cbcf59d5d-2rlhd    0/5   Pending   0   10d
      kube-apiserver-6cbcf59d5d-6mt7f    0/5   Pending   0   7h48m
      
      Karpenter pods stuck in Init containers:
      $ oc get deployment,pod -n <namespace> | grep karpenter
      deployment.apps/karpenter                  0/1   1   0   79d
      deployment.apps/karpenter-operator         0/1   1   0   79d
      pod/karpenter-6cff7dd686-k2cjw             0/2   Init:0/2   0   6h23m
      pod/karpenter-788664774d-d77t4             0/2   Init:0/2   0   6h23m
      pod/karpenter-operator-54d77db7bc-4w4pq    0/2   Init:0/1   0   6h23m
      
      Karpenter availability-prober logs show continuous connection failures:
      {"level":"error","msg":"Request failed, retrying...","error":"Get \"https://kube-apiserver:6443/readyz\": dial tcp 172.30.88.125:6443: connect: connection refused"}
      
      HostedControlPlane status shows degraded:
      status:
        conditions:
        - type: Degraded
          status: "True"
          reason: UnavailableReplicas
          message: karpenter deployment has 2 unavailable replicas
      

      Expected results:

      When a HostedControlPlane is marked for deletion and the control plane is non-functional, the HyperShift operator should:
      
      1. Detect that control plane is unavailable and cannot be recovered
      2. Remove finalizers that depend on running control plane components after a grace period
      3. Allow cleanup to complete even when Karpenter controller cannot run
      
      The HostedControlPlane and all associated resources should be cleaned up within a reasonable timeframe (minutes to hours, not days/weeks).
      

      Additional info:

      Root Cause:
      - Karpenter finalizer removal requires Karpenter controller to be running
      - Karpenter pods have availability-prober init container that blocks on kube-apiserver readiness
      - When kube-apiserver is unavailable, Karpenter pods never start
      - HyperShift operator waits indefinitely for finalizer removal
      - No timeout or fallback cleanup mechanism exists
      
      Impact:
      - HostedControlPlane cannot be deleted
      - Namespace remains with hundreds of resources
      - Continued resource consumption on management cluster
      - Cluster stuck in "deleting" state indefinitely
      
      Proposed Solutions:
      1. Implement grace period (e.g., 1-2 hours) after which finalizers are forcibly removed if control plane is degraded
      2. Add detection for non-functional control plane and skip dependent finalizers
      3. Implement alternative cleanup path that doesn't require Karpenter controller to be running
      
      Similar patterns may affect other finalizers that depend on control plane availability during deletion.
      

              rh-ee-macao Max Cao
              tnierman.openshift Trevor Nierman
              Jie Zhao Jie Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: