Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36890

[CAPI Azure] capi processes are still running when installer failed to start cluster-api-provider-azureaso and exited

XMLWordPrintable

    • Important
    • None
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, some of the processes remained running after the Installer stopped due to setup failures. With this release, all installation processes stop when the Installer stops running. (link:https://issues.redhat.com/browse/OCPBUGS-36890[*OCPBUGS-36890*])
      Show
      * Previously, some of the processes remained running after the Installer stopped due to setup failures. With this release, all installation processes stop when the Installer stops running. (link: https://issues.redhat.com/browse/OCPBUGS-36890 [* OCPBUGS-36890 *])
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-36378. The following is the description of the original issue:

      Description of problem:

      When creating cluster with service principal certificate, as known issues OCPBUGS-36360, installer exited with error.
      
      # ./openshift-install create cluster --dir ipi6 
      INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" 
      WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. 
      INFO Consuming Install Config from target directory 
      WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
      INFO Creating infrastructure resources...         
      INFO Started local control plane with envtest     
      INFO Stored kubeconfig for envtest in: /tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig 
      WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. 
      INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] 
      INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] 
      INFO Running process: azureaso infrastructure provider with args [-v=0 -metrics-addr=0 -health-addr=127.0.0.1:45179 -webhook-port=37401 -webhook-cert-dir=/tmp/envtest-serving-certs-1364466879 -crd-pattern= -crd-management=none] 
      ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "azureaso infrastructure provider": failed to start controller "azureaso infrastructure provider": timeout waiting for process cluster-api-provider-azureaso to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready) 
      INFO Shutting down local Cluster API control plane... 
      INFO Local Cluster API system has completed operations 
      
      From output, local cluster API system is shut down. But when checking processes, only parent process installer exit, CAPI related processes are still running.
      
      When local control plane is running:
      # ps -ef|grep cluster | grep -v grep
      root       13355    6900 39 08:07 pts/1    00:00:13 ./openshift-install create cluster --dir ipi6
      root       13365   13355  2 08:08 pts/1    00:00:00 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true
      root       13373   13355 55 08:08 pts/1    00:00:10 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
      root       13385   13355  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig
      root       13394   13355  6 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig
      
      After installer exited:
      # ps -ef|grep cluster | grep -v grep
      root       13365       1  1 08:08 pts/1    00:00:01 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true
      root       13373       1 45 08:08 pts/1    00:00:35 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
      root       13385       1  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig
      root       13394       1  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig
      
      
      Another scenario, ran capi-based installer on the small disk, and installer stuck there and didn't exit until interrupted until <Ctrl> + C. Then checked that all CAPI related processes were still running, only installer process was killed.
      
      [root@jima09id-vm-1 jima]# ./openshift-install create cluster --dir ipi4
      INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" 
      INFO Consuming Install Config from target directory 
      WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
      INFO Creating infrastructure resources...         
      INFO Started local control plane with envtest     
      INFO Stored kubeconfig for envtest in: /tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig 
      INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] 
      INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] 
      FATAL failed to extract "ipi4/cluster-api/cluster-api-provider-azureaso": write ipi4/cluster-api/cluster-api-provider-azureaso: no space left on device 
      ^CWARNING Received interrupt signal                    
      ^C[root@jima09id-vm-1 jima]#
      [root@jima09id-vm-1 jima]# ps -ef|grep cluster | grep -v grep
      root       12752       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:38889 --data-dir=ipi4/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:38889 --listen-peer-urls=http://127.0.0.1:38859 --unsafe-no-fsync=true
      root       12760       1  4 07:38 pts/1    00:00:09 ipi4/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_3790461974 --client-ca-file=/tmp/k8s_test_framework_3790461974/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:38889 --secure-port=44429 --service-account-issuer=https://127.0.0.1:44429/ --service-account-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
      root       12769       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig
      root       12781       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig
      root       12851    6900  1 07:41 pts/1    00:00:00 ./openshift-install destroy cluster --dir ipi4
       

      Version-Release number of selected component (if applicable):

         4.17 nightly build 

      How reproducible:

          Always

      Steps to Reproduce:

          1. Run capi-based installer
          2. Installer failed to start some capi process and exited 
          3.
          

      Actual results:

          Installer process exited, but capi related processes are still running

      Expected results:

          Both installer and all capi related processes are exited.

      Additional info:

       

       

              rdossant Rafael Fonseca dos Santos
              openshift-crt-jira-prow OpenShift Prow Bot
              Jinyun Ma Jinyun Ma
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: