-
Bug
-
Resolution: Obsolete
-
Normal
-
None
-
4.13
-
Quality / Stability / Reliability
-
False
-
-
1
-
Moderate
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
A customer has defined an alert for mcd_state degraded and noticed that the alert does not clear
Version-Release number of selected component (if applicable):
4.13
How reproducible:
I have not been able to reproduce this but have captured the state in logs and command output
Steps to Reproduce:
curling the machine config daemonset pod from a prometheus instance:
➜ ~ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k https://<$NODEIP>:9001/metrics -H "Authorization: Bearer $TOKEN" | grep mcd_state
# HELP mcd_state state of daemon on specified node
# TYPE mcd_state gauge
mcd_state{reason="",state="Done"} 1.708730448680743e+09
mcd_state{reason="",state="Working"} 1.7087304270475767e+09
mcd_state{reason="failed to drain node: $NODE after 1 hour. Please see machine-config-controller logs for more information",state="Degraded"} 1.7067804848773804e+09
we can see the metric reported 3 times, once with the state "Done", once with "Working" and once with "Degraded" including a reason.
asking the customer to restart the machine config daemonset pod results in the metric showing just one state.
# HELP mcd_state state of daemon on specified node
# TYPE mcd_state gauge
mcd_state{reason="",state="Done"} 1.7091117838342226e+09
Actual results:
an alert configured against `query: mcd_state{state="Degraded"}` never clears
Expected results:
The metric returns a clearly defined result
Additional info: