-
Story
-
Resolution: Obsolete
-
Undefined
-
None
-
None
-
False
-
False
-
-
0
-
0
As follow up to Zack's work at https://github.com/openshift/machine-config-operator/pull/2795
There are many open questions about how we handle validation and forced updates. To summarize our old behaviour:
The on-disk validation used to ONLY happen when dn.booting=true, which is theoretically during boot of a MCD pod (every restart, for example). In an upgrade flow, this happens at theoretically only 1 point, which is when the MCD gets updated (the daemonset rolling out), which should happen before the actual node attempts an upgrade.
Investigation point 1: What actually constitutes "booting" condition? Are we ok with non-reboot updates? I've seen the MCO sometimes preserve logs from the previous run, sometimes not. Does that affect anything?
In terms of the forcefile, we've (perhaps wrongly) called it a "force skip validation". That's not entirely wrong but also not entirely correct. We actually can force an update from config A to A. This isn't actually a desirable behaviour necessarily, since we don't have a correct concept of "update from A to A".
Investigation point 2: what do we consider the forcefile to be? Is it forcing validation skip? Is it forcing an update to the desired config? How do we actually ensure that forced update is correct given our hybrid non-idempotent update schema (e.g. kargs are not applied, but a file can be forced)?
Investigation point 3: Our validation today is very limited. It really just validates files/units/OS image. Many other subfields that are often used are not validated at all. There are actually points that we don't consider at even during update, e.g. Tang keys. See: https://bugzilla.redhat.com/show_bug.cgi?id=1964701
Investigation point 4: Gating a lot of our checks behind dn.booting is a bit weird and I am not sure why this was the case initially. Assuming our booting assumptions are correct (first run), why are we not checking for pending configs after that boot loop has finished on future updates? Should we be pulling a lot of that into the actual sync main logic?
Investigation point 5: Could we even have an "enforcing mode" given that I've seen users take advantage of the lack of checking sometimes to do manual changes. What is our stance on control of MCO managed parameters?
- relates to
-
MCO-68 Check for config drift regularly
- Closed