-
Story
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
Given how the layering stuff was added to the Machine Config Daemon, there are bound to be edge-cases with how layering interacts with the various components of the MCD.
For example, one edge-case I encountered involves Config Drift Monitor. To reproduce the situation:
- A node has opted into layering and is currently booted into an on-cluster built layered OS image.
- One of the files identified in the MachineConfig is mutated, which causes Config Drift Monitor to fire and degrade the node.
- The files contents are restored which will cause the on-disk state validation to succeed. However, because the node was previously degraded, it will attempt to force a sync.
- When it gets to the updateImage() part of the update, rpm-ostree fails thusly:
# rpm-ostree rebase 'ostree-unverified-registry:image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image@sha256:5fb3e0a4735f3451b8c0e8e762dd7de2b224feb6070a3abd0b9e6d57b050bc87' error: Old and new refs are equal: ostree-unverified-registry:image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image@sha256:5fb3e0a4735f3451b8c0e8e762dd7de2b224feb6070a3abd0b9e6d57b050bc87
It makes sense that rpm-ostree fails in this way since it cannot reapply the current OS image, nor does it really make sense to. If memory serves correctly, with Config Drift on MachineConfigs, the MCD will cordon / drain the node, rewrite the files, not reboot the node, then undrain and uncordon it. If one uses the forcefile, it will force a reboot before undraining / uncordoning the node.
Ideally, we should detect that the config drift has been resolved, check if rpm-ostree has the same ref on disk, and no-op if that is the case, transitioning the node from Degraded -> Done in the process.