Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8038

4.12 Cluster Installation Failed with Possible Race Condition


    • Critical
    • No
    • Rejected
    • False
    • Hide



      Description of problem:

      A customer installation of a 4.12.4 cluster failed with a possible race condition.  It appears that bootstrapping failed but CPMS was still working.

      Version-Release number of selected component (if applicable):


      A CPD was identified with no immediate cause determined.  I broke glass to the cluster and checked the Cluster Operators and all were up and running.

      dalong@dalong ~]$ oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.4    True        False         False      5h38m   
      baremetal                                  4.12.4    True        False         False      6h20m   
      cloud-controller-manager                   4.12.4    True        False         False      6h24m   
      cloud-credential                           4.12.4    True        False         False      6h24m   
      cluster-autoscaler                         4.12.4    True        False         False      6h20m   
      config-operator                            4.12.4    True        False         False      6h21m   
      console                                    4.12.4    True        False         False      5h39m   
      control-plane-machine-set                  4.12.4    True        False         False      6h18m   
      csi-snapshot-controller                    4.12.4    True        False         False      6h21m   
      dns                                        4.12.4    True        False         False      6h20m   
      etcd                                       4.12.4    True        False         False      5h52m   
      image-registry                             4.12.4    True        False         False      5h45m   
      ingress                                    4.12.4    True        False         False      5h45m   
      insights                                   4.12.4    True        False         False      6h14m   
      kube-apiserver                             4.12.4    True        False         False      5h41m   
      kube-controller-manager                    4.12.4    True        False         False      6h18m   
      kube-scheduler                             4.12.4    True        False         False      6h17m   
      kube-storage-version-migrator              4.12.4    True        False         False      6h21m   
      machine-api                                4.12.4    True        False         False      5h46m   
      machine-approver                           4.12.4    True        False         False      6h20m   
      machine-config                             4.12.4    True        False         False      6h10m   
      marketplace                                4.12.4    True        False         False      6h20m   
      monitoring                                 4.12.4    True        False         False      5h43m   
      network                                    4.12.4    True        False         False      6h23m   
      node-tuning                                4.12.4    True        False         False      6h20m   
      openshift-apiserver                        4.12.4    True        False         False      5h41m   
      openshift-controller-manager               4.12.4    True        False         False      6h16m   
      openshift-samples                          4.12.4    True        False         False      5h47m   
      operator-lifecycle-manager                 4.12.4    True        False         False      6h21m   
      operator-lifecycle-manager-catalog         4.12.4    True        False         False      6h21m   
      operator-lifecycle-manager-packageserver   4.12.4    True        False         False      5h49m   
      service-ca                                 4.12.4    True        False         False      6h21m   
      storage                                    4.12.4    True        False         False      6h20m   

      Network-Verifier was also ran against the cluster and came back successful.

      Checking the machines everything looked good as well.

      [dalong@dalong ~]$ oc get machines -A
      NAMESPACE               NAME                                       PHASE     TYPE         REGION      ZONE         AGE
      openshift-machine-api   rosa-t7zkl-gslmr-master-1                  Running   m5.2xlarge   eu-west-1   eu-west-1a   6h42m
      openshift-machine-api   rosa-t7zkl-gslmr-master-2                  Running   m5.2xlarge   eu-west-1   eu-west-1a   6h42m
      openshift-machine-api   rosa-t7zkl-gslmr-master-mqx8c-0            Running   m5.2xlarge   eu-west-1   eu-west-1a   6h36m
      openshift-machine-api   rosa-t7zkl-gslmr-worker-eu-west-1a-cl7lf   Running   m5.xlarge    eu-west-1   eu-west-1a   6h37m
      openshift-machine-api   rosa-t7zkl-gslmr-worker-eu-west-1a-hll6m   Running   m5.xlarge    eu-west-1   eu-west-1a   6h37m 


      At the moment bootstrapping failed, CPMS was still working which seems to have caused the failed install.

      time="2023-02-28T09:21:13Z" level=info msg="Cluster operator control-plane-machine-set Progressing is True with ExcessReplicas: Waiting for 1 old replica(s) to be removed" 

            ddonati@redhat.com Damiano Donati
            dalong.openshift Dakota Long
            Gaoyun Pei Gaoyun Pei
            Dakota Long, Sandhya Dasu
            0 Vote for this issue
            22 Start watching this issue
