Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
On Cluster Layering Enhancements
Intelligence Requested:
Market:

Original story points:
5
WSJF:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Given how the layering stuff was added to the Machine Config Daemon, there are bound to be edge-cases with how layering interacts with the various components of the MCD.

For example, one edge-case I encountered involves Config Drift Monitor. To reproduce the situation:

A node has opted into layering and is currently booted into an on-cluster built layered OS image.
One of the files identified in the MachineConfig is mutated, which causes Config Drift Monitor to fire and degrade the node.
The files contents are restored which will cause the on-disk state validation to succeed. However, because the node was previously degraded, it will attempt to force a sync.
When it gets to the updateImage() part of the update, rpm-ostree fails thusly:

# rpm-ostree rebase 'ostree-unverified-registry:image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image@sha256:5fb3e0a4735f3451b8c0e8e762dd7de2b224feb6070a3abd0b9e6d57b050bc87'
error: Old and new refs are equal: ostree-unverified-registry:image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/os-image@sha256:5fb3e0a4735f3451b8c0e8e762dd7de2b224feb6070a3abd0b9e6d57b050bc87

It makes sense that rpm-ostree fails in this way since it cannot reapply the current OS image, nor does it really make sense to. If memory serves correctly, with Config Drift on MachineConfigs, the MCD will cordon / drain the node, rewrite the files, not reboot the node, then undrain and uncordon it. If one uses the forcefile, it will force a reboot before undraining / uncordoning the node.

Ideally, we should detect that the config drift has been resolved, check if rpm-ostree has the same ref on disk, and no-op if that is the case, transitioning the node from Degraded -> Done in the process.

Assignee:: Unassigned

Reporter:: Zack Zlotnik

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2023/08/24 3:44 PM

Updated:: 2025/03/30 3:53 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates