-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When a node is marked down for maintenance, such as with the node maintenance operator, KubeDaemonSetMisScheduled and KubeDaemonSetRolloutStuck alerts begin firing for all platform "openshift-*" daemonsets.
For a cluster with a lot of Red Hat operators, this can reach 20 or more alerts of each type (40 in total).
https://github.com/openshift/cluster-monitoring-operator/blob/release-4.18/assets/control-plane/prometheus-rule.yaml#L116
https://github.com/openshift/cluster-monitoring-operator/blob/release-4.18/assets/control-plane/prometheus-rule.yaml#L167
When a cluster is put into maintenance, the operator does add a new taint:
{"effect":"NoSchedule","key":"medik8s.io/drain"}
We should factor a node being in maintenance / cordoned / SchedulingDisabled when it is in Ready state for the purpose of alerting.
Version-Release number of selected component (if applicable):
4.18.15
How reproducible:
Always
Steps to Reproduce:
1. OpenShift 4.18 Cluster
2. Install Node Maintenance Operator
3. Put a node into maintenance:
apiVersion: nodemaintenance.medik8s.io/v1beta1
kind: NodeMaintenance
metadata:
name: nodemaintenance-cr
spec:
nodeName: worker-0.ocp418shared.tamlab.brq2.redhat.com
reason: "node maint"
Actual results:
A lot of extra alerts that are not providing actionable information.
Expected results:
Additional info: