During upgrade tests, the MCO will become temporarily degraded with the following events showing up in the event log:
This seems to be occurring with some frequency as indicated by its prevalence in CI search:
The MCO should not become degraded during an upgrade unless it cannot proceed with the upgrade. In the case of these failures, I think we're timing out at some point during node reboots as either 1 or 2 of the control plane nodes are ready, with the third being unready. The MCO eventually requeues the syncRequiredMachineConfigPools step and the remaining nodes reboot and the MCO eventually clears the Degraded status.
Indeed, looking at the event breakdown, one can see that control plane nodes take ~21 minutes to roll out their new config with OS upgrades. By comparison, the worker nodes take ~15 minutes.
Meanwhile, the portion of the MCO which performs this sync (the syncRequiredMachineConfigPools function) has a hard-coded timeout of 10 minutes. Additionally, to my understanding, there is an additional 10 minute grace period before the MCO marks itself as degraded. Since the control plane nodes took ~21 minutes to completely reboot and roll out their new configs, we've exceeded the time needed. With this in mind, I propose a path forward:
- Figure out why control plane nodes are taking > 20 minutes for OS upgrades to be performed. My initial guess is that it has to do with etcd reestablishing quorum before proceeding onto the next control plane node whereas the worker nodes don't need to delay for that.
- If we conclude that OS upgrades just take longer to perform for control plane nodes, then maybe we could bump the timeout. Ideally, we could bump the timeout only for the control plane nodes, but that may take some refactoring to do.