[OCPBUGS-36378] [CAPI Azure] capi processes are still running when installer failed to start cluster-api-provider-azureaso and exited - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.17
Component/s: Installer / openshift-installer
Labels:
None

Severity:
Important
Regression:
None
Epic Link:
Azure CAPI Install
Release Blocker:
Proposed
Blocked:
True
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, some processes could be left running if the installation program exited due to infrastructure provisioning failures. With this update, all installation-related processes are terminated when the installation program terminates. (link:https://issues.redhat.com/browse/OCPBUGS-36378[*~~OCPBUGS-36378~~*])

Show
* Previously, some processes could be left running if the installation program exited due to infrastructure provisioning failures. With this update, all installation-related processes are terminated when the installation program terminates. (link: https://issues.redhat.com/browse/OCPBUGS-36378 [* OCPBUGS-36378 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.17.0
Target Backport Versions:

4.16.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When creating cluster with service principal certificate, as known issues OCPBUGS-36360, installer exited with error.

# ./openshift-install create cluster --dir ipi6 
INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" 
WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. 
INFO Consuming Install Config from target directory 
WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
INFO Creating infrastructure resources...         
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig 
WARNING Using client certs to authenticate. Please be warned cluster does not support certs and only the installer does. 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] 
INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig] 
INFO Running process: azureaso infrastructure provider with args [-v=0 -metrics-addr=0 -health-addr=127.0.0.1:45179 -webhook-port=37401 -webhook-cert-dir=/tmp/envtest-serving-certs-1364466879 -crd-pattern= -crd-management=none] 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run controller "azureaso infrastructure provider": failed to start controller "azureaso infrastructure provider": timeout waiting for process cluster-api-provider-azureaso to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready) 
INFO Shutting down local Cluster API control plane... 
INFO Local Cluster API system has completed operations 

From output, local cluster API system is shut down. But when checking processes, only parent process installer exit, CAPI related processes are still running.

When local control plane is running:
# ps -ef|grep cluster | grep -v grep
root       13355    6900 39 08:07 pts/1    00:00:13 ./openshift-install create cluster --dir ipi6
root       13365   13355  2 08:08 pts/1    00:00:00 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true
root       13373   13355 55 08:08 pts/1    00:00:10 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
root       13385   13355  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig
root       13394   13355  6 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig

After installer exited:
# ps -ef|grep cluster | grep -v grep
root       13365       1  1 08:08 pts/1    00:00:01 ipi6/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:41341 --data-dir=ipi6/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:41341 --listen-peer-urls=http://127.0.0.1:34081 --unsafe-no-fsync=true
root       13373       1 45 08:08 pts/1    00:00:35 ipi6/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_50606349 --client-ca-file=/tmp/k8s_test_framework_50606349/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:41341 --secure-port=38483 --service-account-issuer=https://127.0.0.1:38483/ --service-account-key-file=/tmp/k8s_test_framework_50606349/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_50606349/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
root       13385       1  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:36847 --webhook-port=38905 --webhook-cert-dir=/tmp/envtest-serving-certs-941163289 --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig
root       13394       1  0 08:08 pts/1    00:00:00 ipi6/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:44743 --webhook-port=35373 --webhook-cert-dir=/tmp/envtest-serving-certs-3807817663 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi6/.clusterapi_output/envtest.kubeconfig


Another scenario, ran capi-based installer on the small disk, and installer stuck there and didn't exit until interrupted until <Ctrl> + C. Then checked that all CAPI related processes were still running, only installer process was killed.

[root@jima09id-vm-1 jima]# ./openshift-install create cluster --dir ipi4
INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" 
INFO Consuming Install Config from target directory 
WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster. 
INFO Creating infrastructure resources...         
INFO Started local control plane with envtest     
INFO Stored kubeconfig for envtest in: /tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig 
INFO Running process: Cluster API with args [-v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] 
INFO Running process: azure infrastructure provider with args [-v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig] 
FATAL failed to extract "ipi4/cluster-api/cluster-api-provider-azureaso": write ipi4/cluster-api/cluster-api-provider-azureaso: no space left on device 
^CWARNING Received interrupt signal                    
^C[root@jima09id-vm-1 jima]#
[root@jima09id-vm-1 jima]# ps -ef|grep cluster | grep -v grep
root       12752       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/etcd --advertise-client-urls=http://127.0.0.1:38889 --data-dir=ipi4/.clusterapi_output/etcd --listen-client-urls=http://127.0.0.1:38889 --listen-peer-urls=http://127.0.0.1:38859 --unsafe-no-fsync=true
root       12760       1  4 07:38 pts/1    00:00:09 ipi4/cluster-api/kube-apiserver --allow-privileged=true --authorization-mode=RBAC --bind-address=127.0.0.1 --cert-dir=/tmp/k8s_test_framework_3790461974 --client-ca-file=/tmp/k8s_test_framework_3790461974/client-cert-auth-ca.crt --disable-admission-plugins=ServiceAccount --etcd-servers=http://127.0.0.1:38889 --secure-port=44429 --service-account-issuer=https://127.0.0.1:44429/ --service-account-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.crt --service-account-signing-key-file=/tmp/k8s_test_framework_3790461974/sa-signer.key --service-cluster-ip-range=10.0.0.0/24
root       12769       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/cluster-api -v=2 --diagnostics-address=0 --health-addr=127.0.0.1:42017 --webhook-port=41085 --webhook-cert-dir=/tmp/envtest-serving-certs-1774658110 --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig
root       12781       1  0 07:38 pts/1    00:00:00 ipi4/cluster-api/cluster-api-provider-azure -v=2 --health-addr=127.0.0.1:38387 --webhook-port=37783 --webhook-cert-dir=/tmp/envtest-serving-certs-1319713198 --feature-gates=MachinePool=false --kubeconfig=/tmp/jima/ipi4/.clusterapi_output/envtest.kubeconfig
root       12851    6900  1 07:41 pts/1    00:00:00 ./openshift-install destroy cluster --dir ipi4

Version-Release number of selected component (if applicable):

   4.17 nightly build

How reproducible:

    Always

Steps to Reproduce:

    1. Run capi-based installer
    2. Installer failed to start some capi process and exited 
    3.

Actual results:

    Installer process exited, but capi related processes are still running

Expected results:

    Both installer and all capi related processes are exited.

Additional info:

blocks

OCPBUGS-36890 [CAPI Azure] capi processes are still running when installer failed to start cluster-api-provider-azureaso and exited

Closed

is cloned by

OCPBUGS-36890 [CAPI Azure] capi processes are still running when installer failed to start cluster-api-provider-azureaso and exited

Closed

links to

openshift/installer#8693: OCPBUGS-36378: capi: start controllers after WaitGroup is created

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

Assignee:: Rafael Fonseca dos Santos

Reporter:: Jinyun Ma

QA Contact:: Jinyun Ma

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/07/01 9:30 AM

Updated:: 2024/10/01 5:38 PM

Resolved:: 2024/10/01 5:38 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates