-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.12
Description of problem:
Even after a hostedcontrolplane is successfully deleted, some PDBs can remain behind: ❯ k get pdb -n ocm-production-23eq513k328ldsivorgobo1vbu9j670k-mhs-hyper NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE aws-ebs-csi-driver-controller-pdb N/A 1 0 4d1h csi-snapshot-controller-pdb N/A 1 0 4d1h csi-snapshot-webhook-pdb N/A 1 0 4d1h ovn-raft-quorum-guard 2 N/A 0 4d1h Sometimes, this is because the uninstall gets stuck on another component, for example an awsendpointservice as in OCPBUGS-13056. This causes upgrades and/or MachineConfig rollouts to fail on the management cluster because the machine-config-operator cannot drain nodes that are running these pods. The state of the pods in the hostedcontrolplane namespace is: ❯ k get po -n ocm-production-23eq513k328ldsivorgobo1vbu9j670k-mhs-hyper NAME READY STATUS RESTARTS AGE audit-webhook-585fbf86d6-mmjx4 0/2 Init:0/1 0 5h40m audit-webhook-585fbf86d6-q2rxl 0/2 Init:0/1 0 9h audit-webhook-7b56b4ddbc-46rdh 0/2 Init:0/1 0 22h aws-ebs-csi-driver-controller-bcb6bf86-k6vd5 0/7 ContainerCreating 0 9h aws-ebs-csi-driver-operator-5665898f58-f4bg8 0/1 ContainerCreating 0 9h capi-provider-5d77d577c4-bdbxw 0/2 Init:0/1 0 5h40m cloud-network-config-controller-76f4ccf965-hjhrj 0/3 Init:0/1 0 5h40m cluster-api-fc4f9579-zcws2 1/1 Running 0 5h41m control-plane-operator-598ffd9696-kssb2 2/2 Running 0 5h40m csi-snapshot-controller-7d89bf444-khmbp 0/1 ContainerCreating 0 9h csi-snapshot-webhook-8c945d4f9-dvmpw 0/1 ContainerCreating 0 9h metrics-forwarder-deployment-65db577ff6-76jgl 1/1 Running 0 3h46m metrics-forwarder-secret-ensurer-28055191-zdqfk 0/1 Completed 0 2m6s metrics-forwarder-secret-ensurer-28055192-gwqh2 0/1 Completed 0 66s metrics-forwarder-secret-ensurer-28055193-27n9k 0/1 Completed 0 6s multus-admission-controller-7b6db8b49d-8ls5t 0/2 Init:0/1 0 5h40m multus-admission-controller-7b6db8b49d-kcsgw 0/2 Init:0/1 0 5h40m ovnkube-master-0 0/7 Init:0/1 0 3h45m ovnkube-master-1 0/7 Init:0/1 0 9h ovnkube-master-2 5/7 Running 0 4d1h package-operator-remote-phase-manager-7f54bf6c68-2sfnh 0/1 ContainerCreating 0 3h46m validating-webhook-patching-job-28055130-db729 0/1 Completed 0 63m validating-webhook-patching-job-28055160-88drq 0/1 Completed 0 33m validating-webhook-patching-job-28055190-hwcw6 0/1 Completed 0 3m6s validation-webhook-6bb4c7ddc5-qk8sf 1/1 Running 0 3d2h validation-webhook-6bb4c7ddc5-rrvh5 1/1 Running 0 5h40m
Version-Release number of selected component (if applicable):
Observed throughout 4.12 hosted clusters
How reproducible:
100% reproducible if a hostedcontrolplane is deleted, but other resources remain
Steps to Reproduce:
1. Allow a hostedcontrolplane CR to completely delete 2. Cause the uninstall to fail due to an awsendpointservice that won't delete 3. The hostedcontrolplane namespace will have pods and pdbs leftover in the meantime that impact the ability of a management cluster to do MachineConfig rollouts
Actual results:
Expected results:
When a hostedcontrolplane is deleted, all PDBs are cleaned up earlier on.
Additional info:
This bug doesn't impact the management clusters if every hostedcluster/hostedcontrolplane deletion is able to cleanup the hostedcontrolplane namespace eventually - so I understand that if we have high confidence the uninstall bugs are ironed out, this bug itself may not present an issue. However, so far we are still in a state where new uninstall issues crop up fairly regularly.
- is related to
-
OCPBUGS-13056 awsendpointservice stuck deleting due to invalid STS token
- Closed