Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32091

CAPI-Installer leaks processes during unsuccessful installs


    • Critical
    • No
    • Rejected
    • False
    • Hide


    • Release Note Not Required
    • In Progress

      Description of problem:

      If the installer using cluster api exits before bootstrap destroy, it may leak processes which continue to run in the background of the host system. These processes may continue to reconcile cloud resources, so the cluster resources would be created and recreated even when you are trying to delete them. 
      This occurs because the installer runs kube-apiserver, etcd, and the capi provider binaries as subprocesses. If the installer exits without shutting down those subprocesses, due to an error or user interrupt, the processes will continue to run in the background.
      The processes can be identified with the ps command. pgrep and pkill are also useful.
      Brief discussion here of this occurring in PowerVS: https://redhat-internal.slack.com/archives/C05QFJN2BQW/p1712688922574429

      Version-Release number of selected component (if applicable):


      How reproducible:


      Steps to Reproduce:

          1. Run capi-based install (on any platform), by specifying fields below in the install config [0]
          2. Wait until CAPI controllers begin to run. This will be easy to identify because the terminal will fill with controller logs. Particularly you should see [1]
          3. Once the controllers are running interrupt with CTRL + C
      [0] Install config for capi install
      - ClusterAPIInstall=true
      featureSet: CustomNoUpgrade    
      [1] INFO Started local control plane with envtest     
      INFO Stored kubeconfig for envtest in: /c/auth/envtest.kubeconfig 
      INFO Running process: Cluster API with args [-v=2 --metrics-bind-addr=0 --


      Actual results:

          controllers will leak and continue to run. They can be viewed with ps or pgrep
      You may also see INFO Shutting down local Cluster API control plane... 
      That means the Shutdown started but did not complete.

      Expected results:

          The installer should shutdown gracefully and not leak processes, such as:
      ^CWARNING Received interrupt signal                    
      INFO Shutting down local Cluster API control plane... 
      INFO Stopped controller: Cluster API              
      INFO Stopped controller: aws infrastructure provider 
      ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create infrastructure manifest: Post "": unexpected EOF 
      INFO Local Cluster API system has completed operations 

      Additional info:


            padillon Patrick Dillon
            padillon Patrick Dillon
            Gaoyun Pei Gaoyun Pei
            0 Vote for this issue
            6 Start watching this issue