Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-1940

Enhance MCS layered image serving safety during node scale-up by requiring node validation

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • MCO Sprint 279, MCO Sprint 280
    • 0

       Currently, the mcs serves a newly built layered image to scaling nodes as soon as the mosb is successful. This differs from the existing MCS behavior for non-layered nodes, which only serves the new rendered config if UpdatedMachineCount > 0 (i.e., at least one node has successfully updated to it).

      For improved safety during node scale-up while updates are happening, we should enhance the logic in `resolveDesiredImageForPool()` to only serve a new layered image build if:
        1. The MachineOSBuild is successful, AND
        2. At least one node has already updated to that build

        Current Behavior:
        - MCS serves new layered image immediately when MOSB status is successful
        - New nodes scaling up during a rollout get the new image right away

        Proposed Behavior:
        - MCS serves new layered image only after a node has validated it
        - New nodes scaling up during early rollout get the old image, then update after joining cluster
        - This matches the existing MCS rollout safety model for non-layered nodes

        Trade-offs:
        - Pro: Better safety - ensures at least one node has proven the new build works before serving to new nodes
        - Con: More reboots - nodes scaling during updates will boot with old image, then reboot to new image after
        joining
        - Pro: Matches existing MCS behavior patterns, making the system more predictable

        Investigation Required:
        - Confirm that MCS can silently skip serving new image without blocking node provisioning
        - Verify that nodes will retry requests if MCS doesn't respond (rather than timing out)
        - Test behavior when scaling nodes during active MOSB updates

      Code Location:
        pkg/server/cluster_server.go - resolveDesiredImageForPool() function (around line 311-345)

              dkhater@redhat.com Dalia Khater
              dkhater@redhat.com Dalia Khater
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: