-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
Currently, the mcs serves a newly built layered image to scaling nodes as soon as the mosb is successful. This differs from the existing MCS behavior for non-layered nodes, which only serves the new rendered config if UpdatedMachineCount > 0 (i.e., at least one node has successfully updated to it).
For improved safety during node scale-up while updates are happening, we should enhance the logic in `resolveDesiredImageForPool()` to only serve a new layered image build if:
1. The MachineOSBuild is successful, AND
2. At least one node has already updated to that build
Current Behavior:
- MCS serves new layered image immediately when MOSB status is successful
- New nodes scaling up during a rollout get the new image right away
Proposed Behavior:
- MCS serves new layered image only after a node has validated it
- New nodes scaling up during early rollout get the old image, then update after joining cluster
- This matches the existing MCS rollout safety model for non-layered nodes
Trade-offs:
- Pro: Better safety - ensures at least one node has proven the new build works before serving to new nodes
- Con: More reboots - nodes scaling during updates will boot with old image, then reboot to new image after
joining
- Pro: Matches existing MCS behavior patterns, making the system more predictable
Investigation Required:
- Confirm that MCS can silently skip serving new image without blocking node provisioning
- Verify that nodes will retry requests if MCS doesn't respond (rather than timing out)
- Test behavior when scaling nodes during active MOSB updates
Code Location:
pkg/server/cluster_server.go - resolveDesiredImageForPool() function (around line 311-345)