Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11686

[4.10] BareMetalHost CR fails to delete on cluster cleanup

    XMLWordPrintable

Details

    • Important
    • No
    • 1
    • Metal Platform 234, Metal Platform 235
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      1/3: Moving this to 4.12 POSTGA for Telco, not impacting Nokia (they use Billi, the impact here is on ZTP with ACM).  Needs a release note for 4.12 GA.
      12/19: Yellow. Waiting on baremetal team to determine how to disable autocleaning on baremetal host
      12/15: pending next steps based on Telco findings re: ArgoCD
      12/12: Telco Engineering investigating re: possible to control the cleanup flow with ArgoCD
      12/8: Waiting on Telco to test various workarounds and answer questions.
      11/30: changed Telco rank/bucket to 2 since this currently looks like it'd be more of a lab issue (albeit w/ the potential to become a production thing, and albeit the availability of an ugly workaround)
      11/29: Waiting on bug owner to analyze data provide by Telco and debug live
      11/28: Waiting on info from Telco
      11/22: Waiting on info from Telco
      11/21: Y e l l o w. Triage is in progress.
      11/9: proposing this be a Release Blocker for 4.12 since it impacts installation
      11/4: new to the Telco-Grade OCP 4.12 gating list
      Rel Note for Telco: Yes, proposed release note text added by Ian M in this jira
      Show
      1/3: Moving this to 4.12 POSTGA for Telco, not impacting Nokia (they use Billi, the impact here is on ZTP with ACM).  Needs a release note for 4.12 GA. 12/19: Yellow. Waiting on baremetal team to determine how to disable autocleaning on baremetal host 12/15: pending next steps based on Telco findings re: ArgoCD 12/12: Telco Engineering investigating re: possible to control the cleanup flow with ArgoCD 12/8: Waiting on Telco to test various workarounds and answer questions. 11/30: changed Telco rank/bucket to 2 since this currently looks like it'd be more of a lab issue (albeit w/ the potential to become a production thing, and albeit the availability of an ugly workaround) 11/29: Waiting on bug owner to analyze data provide by Telco and debug live 11/28: Waiting on info from Telco 11/22: Waiting on info from Telco 11/21: Y e l l o w. Triage is in progress. 11/9: proposing this be a Release Blocker for 4.12 since it impacts installation 11/4: new to the Telco-Grade OCP 4.12 gating list Rel Note for Telco: Yes, proposed release note text added by Ian M in this jira

    Description

      This is a clone of issue OCPBUGS-11612. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-9955. The following is the description of the original issue:

      Description of problem:

      OCP cluster installation (SNO) using assisted installer running on ACM hub cluster. 
      Hub cluster is OCP 4.10.33
      ACM is 2.5.4
      
      When a cluster fails to install we remove the installation CRs and cluster namespace from the hub cluster (to eventually redeploy). The termination of the namespace hangs indefinitely (14+ hours) with finalizers remaining. 
      
      To resolve the hang we can remove the finalizers by editing both the secret pointed to by BareMetalHost .spec.bmc.credentialsName and BareMetalHost CR. When these finalizers are removed the namespace termination completes within a few seconds.

      Version-Release number of selected component (if applicable):

      OCP 4.10.33
      ACM 2.5.4

      How reproducible:

      Always

      Steps to Reproduce:

      1. Generate installation CRs (AgentClusterInstall, BMH, ClusterDeployment, InfraEnv, NMStateConfig, ...) with an invalid configuration parameter. Two scenarios validated to hit this issue:
        a. Invalid rootDeviceHint in BareMetalHost CR
        b. Invalid credentials in the secret referenced by BareMetalHost.spec.bmc.credentialsName
      2. Apply installation CRs to hub cluster
      3. Wait for cluster installation to fail
      4. Remove cluster installation CRs and namespace

      Actual results:

      Cluster namespace remains in terminating state indefinitely:
      $ oc get ns cnfocto1
      NAME       STATUS        AGE    
      cnfocto1   Terminating   17h

      Expected results:

      Cluster namespace (and all installation CRs in it) are successfully removed.

      Additional info:

      The installation CRs are applied to and removed from the hub cluster using argocd. The CRs have the following waves applied to them which affects the creation order (lowest to highest) and removal order (highest to lowest):
      Namespace: 0
      AgentClusterInstall: 1
      ClusterDeployment: 1
      NMStateConfig: 1
      InfraEnv: 1
      BareMetalHost: 1
      HostFirmwareSettings: 1
      ConfigMap: 1 (extra manifests)
      ManagedCluster: 2
      KlusterletAddonConfig: 2

       

      Attachments

        Issue Links

          Activity

            People

              janders@redhat.com Jacob Anders
              openshift-crt-jira-prow OpenShift Prow Bot
              Dmitry Dmitriev Dmitry Dmitriev
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: