Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-5274

config-policy-controller fails to initialize due to missing secret

XMLWordPrintable

    • 2
    • False
    • None
    • False
    • GRC Sprint 2023-07
    • Important
    • No

      Description of problem:

      On an uninstall attempt of a HyperShift cluster, config-policy-controller-uninstaller was triggered but failed, locking up the hosted cluster uninstall process.

      This bug was split from ACM-4854 after realizing the two similarly-symptomatic bugs had different root causes. This comment from jkulikau@redhat.com provides further context:

      The config-policy-controller-uninstaller signals to the main config-policy-controller to clean up the finalizers, then it waits for the finalizers to be cleaned up. By design, the uninstaller does not remove the finalizers itself, because the main controller would just add them back. In the situations we've seen here, the main controller is not healthy, so it can never remove the finalizers itself.

      The first time we saw this issue (and at least once more since then), the main config-policy-controller pod was not able to initialize because a secret it needs to mount was missing. The error in the pod describe was: `Warning FailedMount 2m21s (x156 over 5h3m) kubelet MountVolume.SetUp failed for volume "klusterlet-config" : secret "config-policy-controller-hub-kubeconfig" not found`.

      How reproducible:

      Unclear what triggers the early removal of the necessary secret config-policy-controller-hub-kubeconfig, but config-policy-controller's bug can be artificially triggered by simply deleting the secret before starting a ManagedCluster deletion.

      Steps to Reproduce:

      1. Remove config-policy-controller's access to secret/config-policy-controller-hub-kubeconfig
      2. Trigger hosted cluster uninstall/ManagedCluster deletion

      Actual results:

      Expected results:

      HC able to successfully uninstall.

      Additional info:

      On the MC in the kublet namespace:

      ❯ k get po -n klusterlet-$CLUSTER_ID_REDACTED                                                         
      NAME                                                              READY   STATUS              RESTARTS        AGE
      config-policy-controller-b66b8f78-qdqfq                           0/2     ContainerCreating   0               17h
      config-policy-controller-uninstall                                1/1     Running             160 (91s ago)   18h
      klusterlet-$CLUSTER_ID-registration-aqs7rs   1/1     Running             0               18h
      klusterlet-$CLUSTER_ID-work-agent-7798mkk2   1/1     Running             0               18h 

      The config-policy-controller-uninstaller pod logs:

      I0407 15:19:03.399787       1 triggeruninstall.go:107] The uninstall preparation is not complete. Sleeping two seconds before checking again.
      I0407 15:19:05.400853       1 triggeruninstall.go:72] Checking if the uninstall preparation is complete 

      The above information is discussed in this slack thread:
      https://redhat-internal.slack.com/archives/C04EUL1DRHC/p1680879204683799

            zyin@redhat.com Zhiwei Yin
            dalong.openshift Dakota Long
            Derek Ho Derek Ho
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: