Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35556

Cluster deployment hungs on Deprovisioning state while deleting the cluster deployment

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.14.z
    • Hive
    • None
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      They are getting a problem while deleting the cluster deployment. The status of cluster deployment will stuck in Deprovisioning state and it will not move forward. When they check the uninstall pod, it went into crash loop back state with below logs:

      ~~~
      time="2024-06-14T06:19:12Z" level=debug msg="Couldn't find install logs provider environment variable. Skipping."
      time="2024-06-14T06:19:12Z" level=debug msg="no additional log fields found"
      time="2024-06-14T06:19:12Z" level=info msg="running file observer" files="[/.azure/osServicePrincipal.json]"
      I0614 06:19:12.068091 1 observer_polling.go:159] Starting file observer
      time="2024-06-14T06:19:12Z" level=info msg="Using loaded object" name=tst-we-int04a-azure-creds namespace=tst-we-int04a type="*v1.Secret"
      time="2024-06-14T06:19:12Z" level=fatal msg="Failed to write file" error="open /.azure/osServicePrincipal.json: permission denied" path=/.azure/osServicePrincipal.json
      ~~~

      • Both install and uninstall job is running with same SCC and they are able to create the same file under /.azure while debug pod mode.
      • The Pod keeps restarting and the issue resolved itself. They have not done anything to change it. events details shows that after the “DeadlineExceeded” -> new job created another pod and same happened 2-3 times-> finally job completed which takes cares of the uninstall (deprovisioing)
      • They have used :

      ~~~
      #oc delete clusterdeployment -n CLUSTER_NAME CLUSTER_NAME
      #oc wait --for=delete -n CLUSTER_NAME clusterdeployment CLUSTER_NAME
      ~~~

      these are events:

      ~~~
      168m Normal SuccessfulDelete job/tst-we-int04a-uninstall Deleted pod: tst-we-int04a-uninstall-kfqpm
      168m Warning DeadlineExceeded job/tst-we-int04a-uninstall Job was active longer than specified deadline
      168m Normal SuccessfulCreate job/tst-we-int04a-uninstall Created pod: tst-we-int04a-uninstall-p8knc
      108m Normal SuccessfulDelete job/tst-we-int04a-uninstall Deleted pod: tst-we-int04a-uninstall-p8knc
      108m Warning DeadlineExceeded job/tst-we-int04a-uninstall Job was active longer than specified deadline
      108m Normal SuccessfulCreate job/tst-we-int04a-uninstall Created pod: tst-we-int04a-uninstall-nsh2g
      50m Normal Completed job/tst-we-int04a-uninstall Job completed
      ~~~

      I dont see any errors in the hive-controller pod
      Ocp version is 4.14, ACM version is 2.8.

              Unassigned Unassigned
              rhn-support-mlele Mihir Lele
              Jianping Shu Jianping Shu
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: