Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-81

MCD: emit earlier events to warn about failing drains

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • MCO Sprint 263 (DevEx), MCO Sprint 264
    • 0.000

      In newer versions of OCP, we have changed our draining mechanism to only fail after 1 hour. This also means that the event which captures the failing drain was also moved to the failure at the 1hr mark.

       

      Today, upgrade tests oft fail with timeouts related to drain errors (PDB or other). There exists no good way to distinguish what pods are failing and for what reason, so we cannot easily aggregate this data in CI to tackle issues related to PDBs to improve upgrade and CI pass rate.

       

      If the MCD, upon a drain run failure, emits the failing pod and reason (PDB, timeout) as an event, it would be easier to write a test to aggregate this data.

       

      Context in this thread: https://coreos.slack.com/archives/C01CQA76KMX/p1633635861184300 

            [MCO-81] MCD: emit earlier events to warn about failing drains

            Drain failure event is currently emitted by MCO once 1 hr timeout has reached.  deads@redhat.com  If I understand correctly, TRT team needs early feedback (emitting event) i.e whenever a drain fails and MCO retries we send an event regarding drain failure? And this is useful mainly to do it for control plane nodes?

            Sinny Kumari added a comment - Drain failure event is currently emitted by MCO once 1 hr timeout has reached.  deads@redhat.com   If I understand correctly, TRT team needs early feedback (emitting event) i.e whenever a drain fails and MCO retries we send an event regarding drain failure? And this is useful mainly to do it for control plane nodes?

            David Eads added a comment -

            TRT is working on making Azure upgrade tests blocking.  There is currently a problem on about 20% of runs that suggests that pods are failing to be scheduled at the same time that a node is failing to drain.

            In order to chase this problem, identification of cases is the first step.  Normally in CI, we do this by producing events during the run, reading those events after the fact, and producing a detailed test failure.  This approach allows our steps to be replicated by customers, CEE, and SRE when debugging similar issues in the field.

            Digging through logs isn't scalable enough to identify impact.

             

            cc rhn-engineering-dgoodwin

            David Eads added a comment - TRT is working on making Azure upgrade tests blocking.  There is currently a problem on about 20% of runs that suggests that pods are failing to be scheduled at the same time that a node is failing to drain. In order to chase this problem, identification of cases is the first step.  Normally in CI, we do this by producing events during the run, reading those events after the fact, and producing a detailed test failure.  This approach allows our steps to be replicated by customers, CEE, and SRE when debugging similar issues in the field. Digging through logs isn't scalable enough to identify impact.   cc rhn-engineering-dgoodwin

              djoshy David Joshy
              jerzhang@redhat.com Yu Qi Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: