Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13184

Uninstalled hostedcontrolplane leaves some PDBs behind

XMLWordPrintable

    • Moderate
    • No
    • Hypershift Sprint 242
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Even after a hostedcontrolplane is successfully deleted, some PDBs can remain behind:
      
      ❯ k get pdb -n ocm-production-23eq513k328ldsivorgobo1vbu9j670k-mhs-hyper 
      NAME                                MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
      aws-ebs-csi-driver-controller-pdb   N/A             1                 0                     4d1h
      csi-snapshot-controller-pdb         N/A             1                 0                     4d1h
      csi-snapshot-webhook-pdb            N/A             1                 0                     4d1h
      ovn-raft-quorum-guard               2               N/A               0                     4d1h
      
      Sometimes, this is because the uninstall gets stuck on another component, for example an awsendpointservice as in OCPBUGS-13056. This causes upgrades and/or MachineConfig rollouts to fail on the management cluster because the machine-config-operator cannot drain nodes that are running these pods. The state of the pods in the hostedcontrolplane namespace is:
      
      ❯ k get po -n ocm-production-23eq513k328ldsivorgobo1vbu9j670k-mhs-hyper                                      
      NAME                                                     READY   STATUS              RESTARTS   AGE
      audit-webhook-585fbf86d6-mmjx4                           0/2     Init:0/1            0          5h40m
      audit-webhook-585fbf86d6-q2rxl                           0/2     Init:0/1            0          9h
      audit-webhook-7b56b4ddbc-46rdh                           0/2     Init:0/1            0          22h
      aws-ebs-csi-driver-controller-bcb6bf86-k6vd5             0/7     ContainerCreating   0          9h
      aws-ebs-csi-driver-operator-5665898f58-f4bg8             0/1     ContainerCreating   0          9h
      capi-provider-5d77d577c4-bdbxw                           0/2     Init:0/1            0          5h40m
      cloud-network-config-controller-76f4ccf965-hjhrj         0/3     Init:0/1            0          5h40m
      cluster-api-fc4f9579-zcws2                               1/1     Running             0          5h41m
      control-plane-operator-598ffd9696-kssb2                  2/2     Running             0          5h40m
      csi-snapshot-controller-7d89bf444-khmbp                  0/1     ContainerCreating   0          9h
      csi-snapshot-webhook-8c945d4f9-dvmpw                     0/1     ContainerCreating   0          9h
      metrics-forwarder-deployment-65db577ff6-76jgl            1/1     Running             0          3h46m
      metrics-forwarder-secret-ensurer-28055191-zdqfk          0/1     Completed           0          2m6s
      metrics-forwarder-secret-ensurer-28055192-gwqh2          0/1     Completed           0          66s
      metrics-forwarder-secret-ensurer-28055193-27n9k          0/1     Completed           0          6s
      multus-admission-controller-7b6db8b49d-8ls5t             0/2     Init:0/1            0          5h40m
      multus-admission-controller-7b6db8b49d-kcsgw             0/2     Init:0/1            0          5h40m
      ovnkube-master-0                                         0/7     Init:0/1            0          3h45m
      ovnkube-master-1                                         0/7     Init:0/1            0          9h
      ovnkube-master-2                                         5/7     Running             0          4d1h
      package-operator-remote-phase-manager-7f54bf6c68-2sfnh   0/1     ContainerCreating   0          3h46m
      validating-webhook-patching-job-28055130-db729           0/1     Completed           0          63m
      validating-webhook-patching-job-28055160-88drq           0/1     Completed           0          33m
      validating-webhook-patching-job-28055190-hwcw6           0/1     Completed           0          3m6s
      validation-webhook-6bb4c7ddc5-qk8sf                      1/1     Running             0          3d2h
      validation-webhook-6bb4c7ddc5-rrvh5                      1/1     Running             0          5h40m

      Version-Release number of selected component (if applicable):

      Observed throughout 4.12 hosted clusters

      How reproducible:

      100% reproducible if a hostedcontrolplane is deleted, but other resources remain

      Steps to Reproduce:

      1. Allow a hostedcontrolplane CR to completely delete
      2. Cause the uninstall to fail due to an awsendpointservice that won't delete
      3. The hostedcontrolplane namespace will have pods and pdbs leftover in the meantime that impact the ability of a management cluster to do MachineConfig rollouts
      

      Actual results:

       

      Expected results:

      When a hostedcontrolplane is deleted, all PDBs are cleaned up earlier on.

      Additional info:

      This bug doesn't impact the management clusters if every hostedcluster/hostedcontrolplane deletion is able to cleanup the hostedcontrolplane namespace eventually - so I understand that if we have high confidence the uninstall bugs are ironed out, this bug itself may not present an issue. However, so far we are still in a state where new uninstall issues crop up fairly regularly.

              agarcial@redhat.com Alberto Garcia Lamela
              mshen.openshift Michael Shen
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: