Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20122

Failure on OCP Upgrade Between 4.11 to 4.12 Due to etcd Operator Issues

    XMLWordPrintable

Details

    • +
    • Critical
    • No
    • ShiftStack Sprint 244, ShiftStack Sprint 245, ShiftStack Sprint 246
    • 3
    • False
    • Hide

      None

      Show
      None
    • Hide
      Upgrade to 4.12 on OpenStack platform could fail when the master nodes were attached to additional networks due to a known race condition when switching from the in-tree cloud provider to the external cloud provider, as during the upgrade there is a short moment where both providers are active at the same time and could report different node IPs. The fix adds an annotation causing both provider to report the same primary node IP, preventing node IP flapping.
      Show
      Upgrade to 4.12 on OpenStack platform could fail when the master nodes were attached to additional networks due to a known race condition when switching from the in-tree cloud provider to the external cloud provider, as during the upgrade there is a short moment where both providers are active at the same time and could report different node IPs. The fix adds an annotation causing both provider to report the same primary node IP, preventing node IP flapping.
    • Bug Fix

    Description

      Description of problem:

      During the upgrade mutlijob for OCP starting from version 4.10 with OVNkubernetes network type on OSP 16.2, the upgrade process encountered an error when upgrading from version 4.11 to 4.12. The cluster operator etcd became unavailable. A specific node, ostest-ttvx4-master-2, is currently in SchedulingDisabled status. Examination of the openshift-etcd namespace reveals that the etcd-ostest-ttvx4-master-0 pod has been reporting errors. Log data suggests issues related to etcd members and their data directories.

      Version-Release number of selected component (if applicable):

      OCP 4.11.50 to 4.12.36
      RHOS-16.2-RHEL-8-20230510.n.1

      How reproducible:

      Always
      

      Steps to Reproduce:

      1.Begin the OCP upgrade process starting from version 4.10
      2.Upgrade from 4.10 to 4.11
      3.Upgrade from 4.11 to 4.12
      

      Actual results:

      The upgrade process fails during the upgrading between versions 4.11 and 4.12, specifically pointing to issues with the etcd operator. The operator reports being unavailable and indicates problems with specific etcd members.

      Expected results:

      Smooth upgrade from 4.11 to 4.12 without any issues.

      Additional info:

      $ oc get co
      NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
      authentication 4.12.36 True False True 4h17m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
      baremetal 4.12.36 True False False 8h
      ...
      ...
      csi-snapshot-controller 4.12.36 True False False 8h
      dns 4.12.36 True False False 8h
      etcd 4.12.36 False True True 4h33m EtcdMembersAvailable: 2 of 4 members are available, NAME-PENDING-172.17.5.228 has not started, ostest-ttvx4-master-0 is unhealthy
      .....
      machine-approver 4.12.36 True False False 8h
      machine-config 4.11.50 True True True 6h14m Unable to apply 4.12.36: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]
      marketplace 4.12.36 True False False 8h
      monitoring 4.12.36 True False False 4h15m
      network 4.12.36 True False False 8h
      node-tuning 4.12.36 True False False 5h15m
      openshift-apiserver 4.12.36 True False True 4h19m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
      ...
      operator-lifecycle-manager-packageserver 4.12.36 True False False 8h
      service-ca 4.12.36 True False False 8h
      storage 4.12.36 True False False 8h 
      $ oc get pods -n openshift-etcd
      NAME READY STATUS RESTARTS AGE
      etcd-guard-ostest-ttvx4-master-0 0/1 Running 0 4h33m
      etcd-guard-ostest-ttvx4-master-1 1/1 Running 0 4h22m
      etcd-guard-ostest-ttvx4-master-2 1/1 Running 0 5h40m
      etcd-ostest-ttvx4-master-0 3/4 Error 58 (5m10s ago) 4h25m
      etcd-ostest-ttvx4-master-1 4/4 Running 0 4h25m
      etcd-ostest-ttvx4-master-2 4/4 Running 2 (4h28m ago) 4h48m
      installer-25-ostest-ttvx4-master-0 0/1 Completed 0 4h47m
      installer-26-ostest-ttvx4-master-0 0/1 Completed 0 4h44m
      installer-27-ostest-ttvx4-master-0 0/1 Completed 0 4h34m
      revision-pruner-25-ostest-ttvx4-master-0 0/1 Completed 0 4h47m
      revision-pruner-26-ostest-ttvx4-master-0 0/1 Completed 0 4h44m
      revision-pruner-26-ostest-ttvx4-master-1 0/1 Completed 0 4h34m
      revision-pruner-27-ostest-ttvx4-master-0 0/1 Completed 0 4h34m
      revision-pruner-27-ostest-ttvx4-master-1 0/1 Completed 0 4h34m
      $ oc logs etcd-ostest-ttvx4-master-0 -n openshift-etcd
      1a4f2630e5f2296f, unstarted, , https://172.17.5.228:2380, , true
      2f6c4ca331daa2de, started, ostest-ttvx4-master-2, https://10.196.2.249:2380, https://10.196.2.249:2379, false
      752ca6c9953eff21, started, ostest-ttvx4-master-1, https://10.196.1.187:2380, https://10.196.1.187:2379, false
      a6d1d802202a55e3, started, ostest-ttvx4-master-0, https://10.196.2.93:2380, https://10.196.2.93:2379, false
      #### attempt 0
            member={name="", peerURLs=[https://172.17.5.228:2380}, clientURLs=[]
            member={name="ostest-ttvx4-master-2", peerURLs=[https://10.196.2.249:2380}, clientURLs=[https://10.196.2.249:2379]
            member={name="ostest-ttvx4-master-1", peerURLs=[https://10.196.1.187:2380}, clientURLs=[https://10.196.1.187:2379]
            member={name="ostest-ttvx4-master-0", peerURLs=[https://10.196.2.93:2380}, clientURLs=[https://10.196.2.93:2379]
            target={name="ostest-ttvx4-master-0", peerURLs=[https://10.196.2.93:2380}, clientURLs=[https://10.196.2.93:2379]
      member "https://10.196.2.93:2380" dataDir has been destroyed and must be removed from the cluster

      Attachments

        Issue Links

          Activity

            People

              maandre@redhat.com Martin André
              ykhodork Yaakov Khodorkovski
              Yaakov Khodorkovski Yaakov Khodorkovski
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: