Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27359

Spurious "wait has exceeded 40 minutes" when etcd operator briefly goes degraded in late upgrade

XMLWordPrintable

    • Moderate
    • No
    • 0.5
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      This is a clone of issue OCPBUGS-25862. The following is the description of the original issue:

      Description of problem:

      At 17:26:09, the cluster is happily upgrading nodes:

      An update is in progress for 57m58s: Working towards 4.14.1: 734 of 859 done (85% complete), waiting on machine-config
      

      At 17:26:54, the upgrade starts to reboot master nodes and COs get noisy (this one specifically is OCPBUGS-20061)

      An update is in progress for 58m50s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available
      

      ~Two minutes later, at 17:29:07, CVO starts to shout about waiting on operators for over 40 despite not indicating anything is wrong earlier:

      An update is in progress for 1h1m2s: Unable to apply 4.14.1: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver
      

      This is only because these operators go briefly degraded during master reboot (which they shouldn't but that is a different story). CVO computes its 40 minutes against the time when it first started to upgrade the given operator so it:

      1. Upgrades etcd / KAS very early in the upgrade, noting the time when it started to do that
      2. These two COs upgrade successfuly and upgrade proceeds
      3. Eventually cluster starts rebooting masters and etcd/KAS go degraded
      4. CVO compares current time against the noted time, discovers its more than 40 minutes and starts warning about it.

      Version-Release number of selected component (if applicable):

      all

      How reproducible:

      Not entirely deterministic:

      1. the upgrade must go for 40m+ between upgrading etcd and upgrading nodes
      2. the upgrade must reboot a master that is not running CVO (otherwise there will be a new CVO instance without the saved times, they are only saved in memory)

      Steps to Reproduce:

      1. Watch oc adm upgrade during the upgrade

      Actual results:

      Spurious "waiting for over 40m" message pops out of the blue

      Expected results:

      CVO simply says "waiting up to 40m on" and this eventually goes away as the node goes up and etcd goes out of degraded.

            afri@afri.cz Petr Muller
            openshift-crt-jira-prow OpenShift Prow Bot
            Jian Li Jian Li
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: