Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-56445

Boot Image Controller should not degrade when golden configmap is slow to update

XMLWordPrintable

    • Low
    • None
    • 1
    • MCO Sprint 271
    • 1
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      This is a clone of issue OCPBUGS-56211. The following is the description of the original issue:

      This was noticed during the upgrade(4.19.ec-4 to 4.19.ec-5) of a build cluster in tech preview mode(slack thread). The boot image controller(MSBIC) currently waits for at least one master node to be updated with a timeout of 15 minutes, which is currently signaled by the golden configmap recording the current MCO hash and release version. This was originally put in place so that the MCO does not update the boot images before the new MCO images have been rolled out, which may result in "backward" pivots, if a node is scaled during an upgrade.

      During the upgrade of the build cluster, a CRD apply failed(unrelated to the boot image updates), causing the MSBIC to timeout before even the new MCO controllers were rolled out. The MCO's operator pod is responsible for rolling out the new controller pods, however due to the degrade set by the MSBIC, this did not take place, and the operator was stuck in a never ending loop. 

      To rectify this, instead of waiting up to 15 minutes, the MSBIC will non fatally exit the sync and attempt a follow-up sync when the golden configmap is updated. This will still result in the boot image being updated, but will not cause issues when the initial parts of the cluster upgrade was slow.

              djoshy David Joshy
              openshift-crt-jira-prow OpenShift Prow Bot
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: