Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-710

RFE: Alert coverage for surprisingly-long desired!=current duration for a given node

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • 0
    • 0

      The machine-config component occasionally rolls out node updates. Since 4.11's mco#3135 moved drain logic to the machine-config controller (MCC), the flow is basically:

      1. MCC picks a node to drain, and bumps machineconfiguration.openshift.io/desiredConfig to point at the new target.
      2. Machine-config daemon (MCD) on that node notices, and decides whether a drain is needed.
        1. If a drain is needed, the MCD sets machineconfiguration.openshift.io/desiredDrain to request a drain.
        2. The MCC performs the drain and sets machineconfiguration.openshift.io/lastAppliedDrain to declare success.
        3. The MCD notices drain success.
      3. The MCD deploys the update, possibly via initiating a reboot.

      There has been drain-is-too-slow alert coverage for a long time, originally via MCDDrainErr, since 4.13's mco#3424 via MCCDrainErr.  But that only covers node-updates that are slow in step 2.2.  If there is a hung MCD, e.g. because of an rpm-ostree bug like OCPBUGS-2866, it's possible that we could hang before 2, e.g. with the hung MCD oblivious to the fact that an update had been requested, and therefore never asking the MCC to drain the node.  Or we could hang in step 3, e.g. if we crash trying to boot into the incoming operating system.

      I think we want some MCC-side coverage for desiredConfig != currentConfig for >= someDuration so that alarms go off in these non-drain cases too.  Something like Deployment's progressDeadlineSeconds.  warning alerts should fire, although I'm agnostic about whether it's a new machine-config alert, or if it bubbled up to MachineConfigPool > ClusterOperator Degraded=True to trigger ClusterOperatorDegraded. The cook-time should probably be longer than MCCDrainErr's hour, so that folks responding to the high-level error already have that low-level hint in place (for node-updates where slow drains are the current sticking point).

      A possible implementation if you go the new-machine-config alert route would be to have the MCC serve a new machine_config_node_updating (or some such) metric whose value is the timestamp of the most recent desiredConfig bump, and then remove the metric when currentConfig catches up. Then the new alert could trigger if time() - machine_config_node_updating > 7200 or some such, if you wanted to set the threshold duration at 2h.

              Unassigned Unassigned
              trking W. Trevor King
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: