-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.13
-
None
-
Moderate
-
None
-
MCO Sprint 230, MCO Sprint 231
-
2
-
False
-
-
N/A
-
Bug Fix
-
Done
Description of problem:
When a MCCDrainError alert is triggered the alert's message says that the drain problem is happening in the wrong node.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2022-12-22-120609 True False 4h59m Cluster version is 4.13.0-0.nightly-2022-12-22-120609
How reproducible:
Always
Steps to Reproduce:
1. Create a PodDisruptionBudget resource apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict 2. Create a pod matching the PodDisruptionBudget $ oc run --restart=Never --labels app=dontevict --image=docker.io/busybox dont-evict-this-pod -- sleep 3h 3. Create a MC apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-file spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK mode: 420 path: /etc/test 4. Wait 1 hour for the MCCDrainError alert to be triggered
Actual results:
The alert is like $ curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring alertmanager-main -o jsonpath={.spec.host})/api/v1/alerts | jq ..... { "activeAt": "2022-12-23T11:24:05.807925776Z", "annotations": { "message": "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may be blocked. For more details check MachineConfigController pod logs: oc logs -f -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller" }, "labels": { "alertname": "MCCDrainError", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.130.0.10:9001", "job": "machine-config-controller", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-193-114.us-east-2.compute.internal", "pod": "machine-config-controller-5468769874-44tnt", "service": "machine-config-controller", "severity": "warning" }, "state": "firing", "value": "1e+00" } The alert message is wrong, since the reported not in "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may....." is not the node where the drain problem happened, but the node running the controller pod.
Expected results:
The alert message should not point to a wrong node, since it can mislead the user.
Additional info:
- is related to
-
MCO-420 Move MCD drain alert into the MCC, revisit error modes
- Closed
- links to