Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-24705

Opting into on-cluster builds sometimes does not respect maxUnavailable setting


      Description of problem:

      When opting into on-cluster builds on both the worker and control plane MachineConfigPools, the maxUnavailable value on the MachineConfigPools is not respected when the newly built image is rolled out to all of the nodes in a given pool.

      Version-Release number of selected component (if applicable):


      How reproducible:

      Sometimes reproducible. I'm still working on figuring out what conditions need to be present for this to occur.

      Steps to Reproduce:

          1. Opt an OpenShift cluster in on-cluster builds by following these instructions: https://github.com/openshift/machine-config-operator/blob/master/docs/OnClusterBuildInstructions.md
          2. Ensure that both the worker and control plane MachineConfigPools are opted in.    

      Actual results:

      Multiple nodes in both the control plane and worker MachineConfigPools are drained and cordoned simultaneously, irrespective of the maxUnavailable value. This is particularly problematic for control plane nodes since draining more than one control plane node at a time can cause etcd issues, in addition to PDBs (Pod Disruption Budgets) which can make the config change take substantially longer or block completely.
      I've mostly seen this issue affect control plane nodes, but I've also seen it impact both control plane and worker nodes.

      Expected results:

      I would have expected the new OS image to be rolled out in a similar fashion as new MachineConfigs are rolled out. In other words, a single node (or nodes up to maxUnavailable for non-control-plane nodes) is cordoned, drained, updated, and uncordoned at a time.

      Additional info:

      I suspect the bug may be someplace within the NodeController since that's the part of the MCO that controls which nodes update at a given time. That said, I've had difficulty reliably reproducing this issue, so finding a root cause could be more involved. This also seems to be mostly confined to the initial opt-in process. Subsequent updates seem to follow the original "rules" more closely.

            zzlotnik@redhat.com Zack Zlotnik
            zzlotnik@redhat.com Zack Zlotnik
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            0 Vote for this issue
            5 Start watching this issue