Failure on OCP Upgrade Between 4.11 to 4.12 Due to etcd Operator Issues


      Description of problem:

      During the upgrade mutlijob for OCP starting from version 4.10 with OVNkubernetes network type on OSP 16.2, the upgrade process encountered an error when upgrading from version 4.11 to 4.12. The cluster operator etcd became unavailable. A specific node, ostest-ttvx4-master-2, is currently in SchedulingDisabled status. Examination of the openshift-etcd namespace reveals that the etcd-ostest-ttvx4-master-0 pod has been reporting errors. Log data suggests issues related to etcd members and their data directories.

      Version-Release number of selected component (if applicable):

      OCP 4.11.50 to 4.12.36

      How reproducible:


      Steps to Reproduce:

      1.Begin the OCP upgrade process starting from version 4.10
      2.Upgrade from 4.10 to 4.11
      3.Upgrade from 4.11 to 4.12

      Actual results:

      The upgrade process fails during the upgrading between versions 4.11 and 4.12, specifically pointing to issues with the etcd operator. The operator reports being unavailable and indicates problems with specific etcd members.

      Expected results:

      Smooth upgrade from 4.11 to 4.12 without any issues.

      Additional info:

      $ oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.36   True        False         True       4h17m   APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
      baremetal                                  4.12.36   True        False         False      8h      
      csi-snapshot-controller                    4.12.36   True        False         False      8h      
      dns                                        4.12.36   True        False         False      8h      
      etcd                                       4.12.36   False       True          True       4h33m   EtcdMembersAvailable: 2 of 4 members are available, NAME-PENDING- has not started, ostest-ttvx4-master-0 is unhealthy
      machine-approver                           4.12.36   True        False         False      8h      
      machine-config                             4.11.50   True        True          True       6h14m   Unable to apply 4.12.36: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]
      marketplace                                4.12.36   True        False         False      8h      
      monitoring                                 4.12.36   True        False         False      4h15m   
      network                                    4.12.36   True        False         False      8h      
      node-tuning                                4.12.36   True        False         False      5h15m   
      openshift-apiserver                        4.12.36   True        False         True       4h19m   APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
      operator-lifecycle-manager-packageserver   4.12.36   True        False         False      8h      
      service-ca                                 4.12.36   True        False         False      8h      
      storage                                    4.12.36   True        False         False      8h   
      $ oc get pods -n openshift-etcd
      NAME                                       READY   STATUS      RESTARTS         AGE
      etcd-guard-ostest-ttvx4-master-0           0/1     Running     0                4h33m
      etcd-guard-ostest-ttvx4-master-1           1/1     Running     0                4h22m
      etcd-guard-ostest-ttvx4-master-2           1/1     Running     0                5h40m
      etcd-ostest-ttvx4-master-0                 3/4     Error       58 (5m10s ago)   4h25m
      etcd-ostest-ttvx4-master-1                 4/4     Running     0                4h25m
      etcd-ostest-ttvx4-master-2                 4/4     Running     2 (4h28m ago)    4h48m
      installer-25-ostest-ttvx4-master-0         0/1     Completed   0                4h47m
      installer-26-ostest-ttvx4-master-0         0/1     Completed   0                4h44m
      installer-27-ostest-ttvx4-master-0         0/1     Completed   0                4h34m
      revision-pruner-25-ostest-ttvx4-master-0   0/1     Completed   0                4h47m
      revision-pruner-26-ostest-ttvx4-master-0   0/1     Completed   0                4h44m
      revision-pruner-26-ostest-ttvx4-master-1   0/1     Completed   0                4h34m
      revision-pruner-27-ostest-ttvx4-master-0   0/1     Completed   0                4h34m
      revision-pruner-27-ostest-ttvx4-master-1   0/1     Completed   0                4h34m
      $ oc logs etcd-ostest-ttvx4-master-0 -n openshift-etcd
      1a4f2630e5f2296f, unstarted, ,, , true
      2f6c4ca331daa2de, started, ostest-ttvx4-master-2,,, false
      752ca6c9953eff21, started, ostest-ttvx4-master-1,,, false
      a6d1d802202a55e3, started, ostest-ttvx4-master-0,,, false
      #### attempt 0
            member={name="", peerURLs=[}, clientURLs=[]
            member={name="ostest-ttvx4-master-2", peerURLs=[}, clientURLs=[]
            member={name="ostest-ttvx4-master-1", peerURLs=[}, clientURLs=[]
            member={name="ostest-ttvx4-master-0", peerURLs=[}, clientURLs=[]
            target={name="ostest-ttvx4-master-0", peerURLs=[}, clientURLs=[]
      member "" dataDir has been destroyed and must be removed from the cluster

