[MCO-81] MCD: emit earlier events to warn about failing drains

Type: Story
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
- mco

Story Points:
3
Blocked:
False
Ready:
False
Epic Link:
Tech debt 4.18

Sprint:
MCO Sprint 263 (DevEx), MCO Sprint 264
WSJF:
0.000

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

In newer versions of OCP, we have changed our draining mechanism to only fail after 1 hour. This also means that the event which captures the failing drain was also moved to the failure at the 1hr mark.

Today, upgrade tests oft fail with timeouts related to drain errors (PDB or other). There exists no good way to distinguish what pods are failing and for what reason, so we cannot easily aggregate this data in CI to tackle issues related to PDBs to improve upgrade and CI pass rate.

If the MCD, upon a drain run failure, emits the failing pod and reason (PDB, timeout) as an event, it would be easier to write a test to aggregate this data.

Context in this thread: https://coreos.slack.com/archives/C01CQA76KMX/p1633635861184300

links to

openshift/machine-config-operator#4726: MCO-81: Emit events to warn about failing drains

Sinny Kumari added a comment - 2021/10/12 6:10 PM

Drain failure event is currently emitted by MCO once 1 hr timeout has reached. deads@redhat.com If I understand correctly, TRT team needs early feedback (emitting event) i.e whenever a drain fails and MCO retries we send an event regarding drain failure? And this is useful mainly to do it for control plane nodes?

Sinny Kumari added a comment - 2021/10/12 6:10 PM Drain failure event is currently emitted by MCO once 1 hr timeout has reached. deads@redhat.com If I understand correctly, TRT team needs early feedback (emitting event) i.e whenever a drain fails and MCO retries we send an event regarding drain failure? And this is useful mainly to do it for control plane nodes?

David Eads added a comment - 2021/10/07 8:55 PM

TRT is working on making Azure upgrade tests blocking. There is currently a problem on about 20% of runs that suggests that pods are failing to be scheduled at the same time that a node is failing to drain.

In order to chase this problem, identification of cases is the first step. Normally in CI, we do this by producing events during the run, reading those events after the fact, and producing a detailed test failure. This approach allows our steps to be replicated by customers, CEE, and SRE when debugging similar issues in the field.

Digging through logs isn't scalable enough to identify impact.

cc rhn-engineering-dgoodwin

David Eads added a comment - 2021/10/07 8:55 PM TRT is working on making Azure upgrade tests blocking. There is currently a problem on about 20% of runs that suggests that pods are failing to be scheduled at the same time that a node is failing to drain. In order to chase this problem, identification of cases is the first step. Normally in CI, we do this by producing events during the run, reading those events after the fact, and producing a detailed test failure. This approach allows our steps to be replicated by customers, CEE, and SRE when debugging similar issues in the field. Digging through logs isn't scalable enough to identify impact. cc rhn-engineering-dgoodwin

Assignee:: David Joshy

Reporter:: Yu Qi Zhang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2021/10/07 8:11 PM

Updated:: 2025/01/09 4:34 PM

Resolved:: 2025/01/09 4:34 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Sinny Kumari added a comment - 2021/10/12 6:10 PM

Expand comment: Sinny Kumari added a comment - 2021/10/12 6:10 PM

Collapse comment: David Eads added a comment - 2021/10/07 8:55 PM

Expand comment: David Eads added a comment - 2021/10/07 8:55 PM

People

Dates