Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-76610

MCD Validation Discrepancy during Interrupted rpm-ostree Finalization

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.19.z
    • RHCOS
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary: MCD enters Degraded loop when `rpm-ostree` finalization fails due to discrepancy between on-disk `currentconfig` and Node annotation

      Issue Type: Bug
      Priority: High (Severity 2)
      Component: Machine Config Operator (MCO) , RHCOS
      Labels: shift_sno, mco, rpm-ostree

      Description :

      In OpenShift 4.18/4.19, the Machine Config Daemon (MCD) can enter a persistent "Degraded" state if a node reboot is interrupted or if `ostree-finalize-staged.service` fails to complete (e.g., due to a timeout).

      During this failure mode, the MCD writes the "new" configuration to the on-disk `currentconfig` file before the kernel has successfully pivoted into it. Upon reboot, the MCD prioritizes the on-disk file over the Node's `currentConfig` annotation, leading to a validation failure because the actual running state (kernel arguments, etc.) matches the old configuration, not the "new" one recorded on disk.

      Steps to Reproduce : 

      1. Trigger a Configuration Change: Apply a `MachineConfig` that requires a reboot and additional kernel arguments (e.g., `audit=1`, `page_poison=1`).
      2. Simulate/Induce Finalization Failure: Interrupt the `ostree-finalize-staged.service` during the shutdown/reboot process so it exits with a 'timeout' or failure result.
      3. Boot into Old Deployment: Allow the node to reboot. Because finalization failed, the node remains on the previous RHCOS deployment/kernel version.
      4. Observe MCD Logs: Check the logs of the `machine-config-daemon` pod.

      Actual Results : 

      • Validation Error: MCD detects that the on-disk file overrides the node annotation:
        `"Disk currentConfig \"rendered-master-NEW\" overrides node's currentConfig annotation \"rendered-master-OLD\""`.
      • State Mismatch: MCD validates the system against the NEW config but finds the OLD state, resulting in a Degraded status:
        `"unexpected on-disk state validating against rendered-master-NEW: missing expected kernel arguments: [...]"`.
      • Persistent Loop: The node remains Degraded and fails to reconcile because it is stuck validating against a configuration it never successfully booted.

      Expected Results :

      • The MCD should verify that the booted deployment matches the configuration defined in the on-disk `currentconfig` before using it for validation.
      • If a discrepancy exists (i.e., the pivot failed), the MCD should fallback to the Node's `currentConfig` annotation or automatically clean up the stale on-disk file to allow for a clean reconciliation.

      Environment Information : 

      • OCP Version: 4.18 / 4.19
      • Platform: Single Node OpenShift (SNO), Disconnected, Bare Metal
      • OS: RHCOS
      • Disconnected Air Gapped and Secured Environment  

      Workaround : 

      1. Access the node via SSH.
      2. Backup and delete the stale on-disk config: `rm /etc/machine-config-daemon/currentconfig`.
      3. Reboot the node: `systemctl reboot`.
      4. After reboot, the MCD will correctly reconcile against the Node annotation.

              Unassigned Unassigned
              rhn-support-nchoudhu Novonil Choudhuri
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: