Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-10266

[Troubleshoot topic] Workaround for Hypershift Destroy Cluster failing (Agent platform)

XMLWordPrintable

    • False
    • None
    • False
    • No

      Create an informative issue (See each section, incomplete templates/issues won't be triaged)

      Using the current documentation as a model, please complete the issue template. 

      Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.

      Prerequisite: Start with what we have

      Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:

       - Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes

       - Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs 

      Describe the changes in the doc and link to your dev story

      Provide info for the following steps:

      1. - [x] Mandatory Add the required version to the Fix version/s field.

      2. - [ ] Mandatory Choose the type of documentation change.

            - [x] New topic in an existing section or new section
            - [ ] Update to an existing topic

      3. - [ ] Mandatory for GA content:
                  
             - [] Add steps and/or other important conceptual information here: 
             
                  
             - [x] Add Required access level for the user to complete the task here: Admin
             

             - [x] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?)
           Verify all of the hypershift CRs are removed

       

      $ oc get hostedcluster -A
      No resources found
      $ oc get hostedcontrolplane -A
      No resources found
      $ oc get nodepool -A
      No resources found

      Verify all agents are ready to be reused

       

       

      $ oc get agent -A -ojsonpath='{.items[*]}' | jq '{"agent name": .metadata.name, "agentMachineRef label": .metadata.labels["agentMachineRef"], "name": .status.inventory.hostname, "validations": [(.status.conditions[] | (select(.type == "Bound")), select(.type == "Validated") | .type + " " + .status )]}' 

      Example output:

       

       

      {
        "agent name": "173357da-436a-409c-ac41-ba913cee49b1",
        "agentMachineRef label": null,
        "name": "worker-0-2",
        "validations": [
          "Validated True",
          "Bound False"
        ]
      }

      agentMachineRef should be null, Bound should be False, Validated should be True
           
             - [x] Add link to dev story here: https://issues.redhat.com/browse/OCPBUGS-29854

       

      4. - [ ] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation:

       

      This is a new troubleshooting topic for ACM 2.10 and might be added to this section: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.9/html/clusters/cluster_mce_overview#troubleshooting-mce

      Suggested title: Troubleshooting failure to destroy hosted control plane clusters (agent platform)

      Symptom

      When destroying a hosted control plane cluster (agent platform) fails with an error similar to this:

       

      $ hcp destroy cluster agent --name hosted-0 --cluster-grace-period 20m0s
      2024-02-22T09:36:19-05:00    INFO    Found hosted cluster    {"namespace": "clusters", "name": "hosted-0"}
      2024-02-22T09:36:19-05:00    INFO    Deleting hosted cluster    {"namespace": "clusters", "name": "hosted-0"}
      2024-02-22T09:56:19-05:00    ERROR    HostedCluster deletion failed    {"namespace": "clusters", "name": "hosted-0", "error": "context deadline exceeded"}
      github.com/openshift/hypershift/cmd/cluster/core.waitForClusterDeletion
          /home/hypershift/cmd/cluster/core/destroy.go:268
      github.com/openshift/hypershift/cmd/cluster/core.DestroyCluster
          /home/hypershift/cmd/cluster/core/destroy.go:130
      github.com/openshift/hypershift/cmd/cluster/none.DestroyCluster
          /home/hypershift/cmd/cluster/none/destroy.go:47
      github.com/openshift/hypershift/cmd/cluster/agent.DestroyCluster
          /home/hypershift/cmd/cluster/agent/destroy.go:40
      github.com/openshift/hypershift/product-cli/cmd/cluster/agent.NewDestroyCommand.func1
          /home/hypershift/product-cli/cmd/cluster/agent/destroy.go:19
      github.com/spf13/cobra.(*Command).execute
          /home/hypershift/vendor/github.com/spf13/cobra/command.go:983
      github.com/spf13/cobra.(*Command).ExecuteC
          /home/hypershift/vendor/github.com/spf13/cobra/command.go:1115
      github.com/spf13/cobra.(*Command).Execute
          /home/hypershift/vendor/github.com/spf13/cobra/command.go:1039
      github.com/spf13/cobra.(*Command).ExecuteContext
          /home/hypershift/vendor/github.com/spf13/cobra/command.go:1032
      main.main
          /home/hypershift/product-cli/main.go:60
      runtime.main
          /home/hypershift/go/src/runtime/proc.go:267
      2024-02-22T09:56:19-05:00    ERROR    Failed to destroy cluster    {"error": "context deadline exceeded"}
      github.com/openshift/hypershift/product-cli/cmd/cluster/agent.NewDestroyCommand.func1
          /home/hypershift/product-cli/cmd/cluster/agent/destroy.go:20
      github.com/spf13/cobra.(*Command).execute
          /home/hypershift/vendor/github.com/spf13/cobra/command.go:983
      github.com/spf13/cobra.(*Command).ExecuteC
          /home/hypershift/vendor/github.com/spf13/cobra/command.go:1115
      github.com/spf13/cobra.(*Command).Execute
          /home/hypershift/vendor/github.com/spf13/cobra/command.go:1039
      github.com/spf13/cobra.(*Command).ExecuteContext
          /home/hypershift/vendor/github.com/spf13/cobra/command.go:1032
      main.main
          /home/hypershift/product-cli/main.go:60
      runtime.main
          /home/hypershift/go/src/runtime/proc.go:267
      Error: context deadline exceeded
      context deadline exceeded 

      If there are remaining Machine CR(s), but no AgentMachine CRs:

       

      $ oc get machine -A
      NAMESPACE           NAME             CLUSTER          NODENAME   PROVIDERID   PHASE      AGE   VERSION
      clusters-hosted-0   hosted-0-9gg8b   hosted-0-nhdbp                           Deleting   10h   4.15.0-rc.8
      
      $ oc get agentmachine -A
      No resources found

       

       

       

      Resolving the problem: Manually remove the remaining Machine CR(s)

      1. Edit the Machine CR(s) and remove the finalizer
      $ oc edit machine -A
      
      1. Re-run the hcp destroy cluster command
      $ hcp destroy cluster agent --name hosted-0 --cluster-grace-period 20m0s

              sdudhgao@redhat.com Servesha Dudhgaonkar
              cchun@redhat.com Crystal Chun
              David Huynh David Huynh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: