Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-26020

Control plane machine set operator (CPMSO) fails to roll-out control plane nodes, hanging, because etcd new pod fails with Init:CrashLoopBackOff

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Normal Normal
    • None
    • 4.13.z
    • Etcd
    • None
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Control plane machine set operator (CPMSO) fails to roll-out control plane nodes, hanging, because the new etcd pod fails with Init:CrashLoopBackOff.
      
      The error comes from the init container "etcd-ensure-env-vars" because environment parameters are missing from the deployed static yaml. For example:
      
      ~~~
        etcd-ensure-env-vars:
      ...
          Command:
            /bin/sh
            -c
            #!/bin/sh
            set -euo pipefail
            
            : "${NODE_rugouvei_cluster1_bkw7w_master_kxp8p_1_ETCD_URL_HOST?not set}"
            : "${NODE_rugouvei_cluster1_bkw7w_master_kxp8p_1_ETCD_NAME?not set}"
            : "${NODE_rugouvei_cluster1_bkw7w_master_kxp8p_1_IP?not set}"
      ...  
          State:      Terminated
            Reason:   Error
            Message:  /bin/sh: line 4: NODE_rugouvei_cluster1_bkw7w_master_kxp8p_1_ETCD_URL_HOST: not set
      ~~~
      
      The error is correct. The only "NODE_*" parameters listed are:
      
      ~~~
            NODE_rugouvei_cluster1_bkw7w_master_1_ETCD_NAME:      rugouvei-cluster1-bkw7w-master-1
            NODE_rugouvei_cluster1_bkw7w_master_1_ETCD_URL_HOST:  10.44.135.201
            NODE_rugouvei_cluster1_bkw7w_master_1_IP:             10.44.135.201
            NODE_rugouvei_cluster1_bkw7w_master_2_ETCD_NAME:      rugouvei-cluster1-bkw7w-master-2
            NODE_rugouvei_cluster1_bkw7w_master_2_ETCD_URL_HOST:  10.44.135.240
            NODE_rugouvei_cluster1_bkw7w_master_2_IP:             10.44.135.240
            NODE_IP:                                               (v1:status.podIP)
      ~~~
      
      "rugouvei_cluster1_bkw7w_master_0" was the first machine/node to be deleted, and the new node "rugouvei-cluster1-bkw7w-master-kxp8p-1" is not in the list causing the container to fail.
        

      Version-Release number of selected component (if applicable):

      Reproduced with freshly installed VMware IPI cluster 4.13.23 in Lab.

      Steps to Reproduce:

          1. Install IPI VMware cluster 4.13.23.
          2. Configure CPMSO following documentation: https://docs.openshift.com/container-platform/4.13/machine_management/control_plane_machine_management/cpmso-about.html
          3. Manually delete "master-0".

      Actual results:

      $ ./oc get machines -n openshift-machine-api -l machine.openshift.io/cluster-api-machine-role=master
      NAME                                     PHASE      TYPE   REGION   ZONE   AGE
      rugouvei-cluster1-bkw7w-master-1         Deleting                          5h9m
      rugouvei-cluster1-bkw7w-master-2         Running                           5h9m
      rugouvei-cluster1-bkw7w-master-gx9pd-0   Running                           34m
      rugouvei-cluster1-bkw7w-master-kxp8p-1   Running                           34m
      
      $ ./oc get nodes -l node-role.kubernetes.io/master=
      NAME                                     STATUS   ROLES                  AGE    VERSION
      rugouvei-cluster1-bkw7w-master-1         Ready    control-plane,master   5h7m   v1.26.9+636f2be
      rugouvei-cluster1-bkw7w-master-2         Ready    control-plane,master   5h7m   v1.26.9+636f2be
      rugouvei-cluster1-bkw7w-master-gx9pd-0   Ready    control-plane,master   30m    v1.26.9+636f2be
      rugouvei-cluster1-bkw7w-master-kxp8p-1   Ready    control-plane,master   30m    v1.26.9+636f2be
      
      $ ./oc -n openshift-etcd get pod -l app=etcd --show-labels
      NAME                                          READY   STATUS                  RESTARTS         AGE   LABELS
      etcd-rugouvei-cluster1-bkw7w-master-1         4/4     Running                 0                61m   app=etcd,etcd=true,k8s-app=etcd,revision=9
      etcd-rugouvei-cluster1-bkw7w-master-2         4/4     Running                 0                63m   app=etcd,etcd=true,k8s-app=etcd,revision=9
      etcd-rugouvei-cluster1-bkw7w-master-kxp8p-1   0/4     Init:CrashLoopBackOff   10 (2m42s ago)   29m   app=etcd,etcd=true,k8s-app=etcd,revision=9    

      Expected results:

      New masters to roll-out.

       

            dwest@redhat.com Dean West
            rugouvei@redhat.com Rui Gouveia
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: