We're seeing a slight uptick in how long upgrades are taking. We are not 100% sure of the cause, but it looks like it started with 4.11 rc.7. There's no obvious culprits in the diff.
Looking at some of the jobs, we are seeing the gaps between kube-scheduler being updated and then machine-api appear to take longer. Example job run showing 10+ minutes waiting for it.
TRT had a debugging session, and we have two suggestions:
- Adding logging around when CVO sees an operator version changed
- Instead of a fixed polling interval at 5 minutes (which is what we think CVO is doing), would it be possible to trigger on the CO to know when to look again? We think there could be some substantial savings on upgrade time by doing this.