-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Control plane deployments created by the control-plane-operator (including capi-provider, cluster-api, and cloud-controller-manager-*) do not have finalizers to protect against accidental deletion. These
deployments manage CAPI resources (MachineDeployment, Machine, platform-specific machines) that DO have finalizers requiring their controllers to process cleanup operations.
If a deployment like capi-provider is accidentally deleted before HostedCluster deletion:
- The deployment is deleted immediately (no finalizer protection)
- The CAPI provider controller stops running
- During HostedCluster deletion, CAPI resources are marked for deletion
- CAPI resource finalizers cannot be processed (controller is gone)
- Cloud resources (EC2 instances, VMs, disks, NICs, load balancers) are orphaned
- CAPI resources stuck in Terminating state indefinitely
This affects all CAPI-based platforms: AWS, Azure, GCP, OpenStack, KubeVirt, PowerVS, Agent.
Code references:
- No finalizers on deployments: support/controlplane-component/builder.go (NewDeploymentComponent)
- CAPI resources have finalizers: vendor/sigs.k8s.io/cluster-api/api/v1beta1/machinedeployment_types.go:30
- Platform machine finalizers: vendor/sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta2/awsmachine_types.go:27
- Deletion flow: hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go:3185
- Compare with NodePool which correctly uses finalizer: hypershift-operator/controllers/nodepool/nodepool_controller.go:54
Version-Release number of selected component (if applicable):
All HyperShift versions
How reproducible:
Always - any accidental deletion of capi-provider or cluster-api deployment
Steps to Reproduce:
- Create HostedCluster with NodePool on any CAPI platform (AWS, Azure, etc.)
- Wait for MachineDeployment and cloud instances to be created
- Delete capi-provider deployment: oc delete deployment capi-provider -n <control-plane-namespace>
- Delete the HostedCluster: oc delete hostedcluster <name>
Actual results:
- Deployment deletes immediately without finalizer protection
- CAPI resources (MachineDeployment, Machine, AWSMachine) stuck in Terminating state
- Cloud resources (EC2 instances, VMs, disks) orphaned and continue running
- Manual cleanup required via cloud provider console
- Potential cost implications from orphaned resources
Expected results:
- Critical deployments should have finalizers to prevent accidental deletion
- If deployment is marked for deletion, it should wait for dependent resources to clean up
- Cloud resources should be properly deleted when HostedCluster is deleted
- No orphaned cloud infrastructure
Additional info:
Affected deployments (confirmed via code search):
Critical:
- capi-provider (manages platform machines: AWSMachine, AzureMachine, etc.)
- cluster-api (manages MachineDeployment, MachineSet, Machine)
Also potentially affected:
- cloud-controller-manager-* (AWS, Azure, OpenStack, KubeVirt, PowerVS)
- autoscaler
- karpenter/karpenter-operator