-
Bug
-
Resolution: Done
-
Major
-
ACM 2.7.3
-
2
-
False
-
None
-
False
-
-
-
GRC Sprint 2023-07
-
Important
-
No
Description of problem:
On an uninstall attempt of a HyperShift cluster, config-policy-controller-uninstaller was triggered but failed, locking up the hosted cluster uninstall process.
This bug was split from ACM-4854 after realizing the two similarly-symptomatic bugs had different root causes. This comment from jkulikau@redhat.com provides further context:
The config-policy-controller-uninstaller signals to the main config-policy-controller to clean up the finalizers, then it waits for the finalizers to be cleaned up. By design, the uninstaller does not remove the finalizers itself, because the main controller would just add them back. In the situations we've seen here, the main controller is not healthy, so it can never remove the finalizers itself.
The first time we saw this issue (and at least once more since then), the main config-policy-controller pod was not able to initialize because a secret it needs to mount was missing. The error in the pod describe was: `Warning FailedMount 2m21s (x156 over 5h3m) kubelet MountVolume.SetUp failed for volume "klusterlet-config" : secret "config-policy-controller-hub-kubeconfig" not found`.
How reproducible:
Unclear what triggers the early removal of the necessary secret config-policy-controller-hub-kubeconfig, but config-policy-controller's bug can be artificially triggered by simply deleting the secret before starting a ManagedCluster deletion.
Steps to Reproduce:
- Remove config-policy-controller's access to secret/config-policy-controller-hub-kubeconfig
- Trigger hosted cluster uninstall/ManagedCluster deletion
Actual results:
Expected results:
HC able to successfully uninstall.
Additional info:
On the MC in the kublet namespace:
❯ k get po -n klusterlet-$CLUSTER_ID_REDACTED NAME READY STATUS RESTARTS AGE config-policy-controller-b66b8f78-qdqfq 0/2 ContainerCreating 0 17h config-policy-controller-uninstall 1/1 Running 160 (91s ago) 18h klusterlet-$CLUSTER_ID-registration-aqs7rs 1/1 Running 0 18h klusterlet-$CLUSTER_ID-work-agent-7798mkk2 1/1 Running 0 18h
The config-policy-controller-uninstaller pod logs:
I0407 15:19:03.399787 1 triggeruninstall.go:107] The uninstall preparation is not complete. Sleeping two seconds before checking again.
I0407 15:19:05.400853 1 triggeruninstall.go:72] Checking if the uninstall preparation is complete
The above information is discussed in this slack thread:
https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1680879204683799