-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18
-
None
-
Moderate
-
None
-
False
-
Description of problem
Since machine-api-operator#986 introduced (name, namespace) aggregation, $labels.phase substitution in the summary has stopped working, and admins just see a message like:
machine build09-... is in phase:
when we want them to see:
machine build09-... is in phase: Provisioning
or similar.
Version-Release number of selected component
I haven't tracked machine-api-operator#986 down to a 4.y, but it's a 2022 pull request, so likely a bunch. Not sure a fix is worth backporting or not.
How reproducible
Every time.
Steps to Reproduce
1. Break machine provisioning, maybe by creating a Machine(Set) with a garbage instance type?
2. See the MachineWithNoRunningPhase alert pending or firing.
3. Check the summary on that alert.
Actual results
machine build09-... is in phase:
without a phase listed.
Expected results
machine ... is in phase: Provisioning
or similar, mentioning the actual phase.
Additional info
You can probably just include phase in the list of labels that you preserve in the sum by ....
You may also want to add $labels.namespace to the summary or description to help admins locate the troubled machine. You can also use with subqueries like the CVO does to give cluster admins oc and web-console hints about how to dig in.
You may also want to add runbook_url pointing at your alert docs, (which you could also move to the runbook repo), if you wanted to make that more accessible.
No need to boil the ocean here, I'm just floating some additional polish in case it's easier to handle more things in one pull. My personal definition of done for this ticket is limited to populating the phase in the summary.