-
Spike
-
Resolution: Won't Do
-
Minor
-
None
-
None
-
None
-
False
-
None
-
False
-
OCPSTRAT-763 - [TechPreview]Disconnected Cluster Update and Boot without local image registry - phase 1
-
-
-
0
-
0
Background:
The MachineConfig pool currently has a condition of "updating" that depends on whether or not a machine happens to be cordoned (among other things).
The logic that decides this is here: https://github.com/openshift/machine-config-operator/blob/5cc821eb953c85764c2a092d53aaae34e1f1ac17/pkg/controller/node/status.go#L77
allUpdated := updatedMachineCount == machineCount && readyMachineCount == machineCount && unavailableMachineCount == 0
And if you chase all those states back through the code you end up with more or less:
state | Per-Node Logic That Decides When We're In This State | Notes |
---|---|---|
done | currentConfig == desiredConfig AND MCD state is "Done" | |
updated | done AND currentConfig == pool.Spec.Configuration.Name | |
ready | NodeReady AND !NodeDiskPressure AND !NodeNetworkUnavailable AND !Unschedulable | disk pressure doesn't really surface anywhere, so it's kind of sneaky |
unavailable | !ready OR ( !done AND (Degraded OR Unreconcilable) ) |
So, you can see, if a node becomes "Unschedulable" for any reason (even if the MCO didn't do it):
- The MCO regards the node as "unavailable"
- allUpdated is no longer true
- The MCO sets the pool condition of "Updating" to true
This is customer facing in the "oc output" for a machineconfig pool, and "updating" also gets set as a condition, so we look like we're updating when we're really not.
UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE True False False 3 3 3 0 3d22h False True False 4 3 4 0 3d22h
We need to find a way to tell the truth.
Goal:
- What should a pool state of "updating" mean in this context?
Should it mean:
- At least one machine-config-daemon is working?
- A new desiredConfig has been applied to at least one node and it hasn't been reconciled?
- There is at least one machine in the pool that hasn't been updated completely yet?
- "Everything isn't done yet, so therefore I should be updating" (vs. "I am updating")
- Other?