Let's say we start with a 4.10 a cluster where some nodes have both infra and worker roles (or master and worker roles) while others only have the worker roles. The infra nodes are managed by their own infra mcp, which inherits the default worker machineconfigs (in the supported way).
In such a cluster, if we upgrade to 4.11, upgrade completes but the keepalived-monitor never performs the migration, because it mistakenly believes that the migration has never ended. We can confirm with error messages like the following:
time="2023-06-23T14:40:00Z" level=info msg="Failed to retrieve upgrade status or Upgrade still running" err="<nil>" upgradeRunning=true
If we temporarily remove one of the roles, so that each node has one role only, the migration eventually happens and ends successfully.