Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
tech debt catchall/intake
Intelligence Requested:
Market:

Cost of Delay:
0
WSJF:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

The machine-config component occasionally rolls out node updates. Since 4.11's mco#3135 moved drain logic to the machine-config controller (MCC), the flow is basically:

MCC picks a node to drain, and bumps machineconfiguration.openshift.io/desiredConfig to point at the new target.
Machine-config daemon (MCD) on that node notices, and decides whether a drain is needed.
1. If a drain is needed, the MCD sets machineconfiguration.openshift.io/desiredDrain to request a drain.
2. The MCC performs the drain and sets machineconfiguration.openshift.io/lastAppliedDrain to declare success.
3. The MCD notices drain success.
The MCD deploys the update, possibly via initiating a reboot.

There has been drain-is-too-slow alert coverage for a long time, originally via MCDDrainErr, since 4.13's mco#3424 via MCCDrainErr. But that only covers node-updates that are slow in step 2.2. If there is a hung MCD, e.g. because of an rpm-ostree bug like ~~OCPBUGS-2866~~, it's possible that we could hang before 2, e.g. with the hung MCD oblivious to the fact that an update had been requested, and therefore never asking the MCC to drain the node. Or we could hang in step 3, e.g. if we crash trying to boot into the incoming operating system.

I think we want some MCC-side coverage for desiredConfig != currentConfig for >= someDuration so that alarms go off in these non-drain cases too. Something like Deployment's progressDeadlineSeconds. warning alerts should fire, although I'm agnostic about whether it's a new machine-config alert, or if it bubbled up to MachineConfigPool > ClusterOperator Degraded=True to trigger ClusterOperatorDegraded. The cook-time should probably be longer than MCCDrainErr's hour, so that folks responding to the high-level error already have that low-level hint in place (for node-updates where slow drains are the current sticking point).

A possible implementation if you go the new-machine-config alert route would be to have the MCC serve a new machine_config_node_updating (or some such) metric whose value is the timestamp of the most recent desiredConfig bump, and then remove the metric when currentConfig catches up. Then the new alert could trigger if time() - machine_config_node_updating > 7200 or some such, if you wanted to set the threshold duration at 2h.

is related to

MCO-453 What should the MCP condition "updating" mean?

Closed

relates to

OCPBUGS-2866 EUS upgrade: rpm-ostree clean up timeout was reached

Closed

Assignee:: Unassigned

Reporter:: W. Trevor King

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2023/08/04 9:03 PM

Updated:: 2024/09/11 9:42 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates