-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
In newer versions of OCP, we have changed our draining mechanism to only fail after 1 hour. This also means that the event which captures the failing drain was also moved to the failure at the 1hr mark.
Today, upgrade tests oft fail with timeouts related to drain errors (PDB or other). There exists no good way to distinguish what pods are failing and for what reason, so we cannot easily aggregate this data in CI to tackle issues related to PDBs to improve upgrade and CI pass rate.
If the MCD, upon a drain run failure, emits the failing pod and reason (PDB, timeout) as an event, it would be easier to write a test to aggregate this data.
Context in this thread: https://coreos.slack.com/archives/C01CQA76KMX/p1633635861184300
Drain failure event is currently emitted by MCO once 1 hr timeout has reached. deads@redhat.com If I understand correctly, TRT team needs early feedback (emitting event) i.e whenever a drain fails and MCO retries we send an event regarding drain failure? And this is useful mainly to do it for control plane nodes?