-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.13
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
MCO Sprint 230, MCO Sprint 231
-
2
-
Done
-
Bug Fix
-
N/A
-
None
-
None
-
None
-
None
Description of problem:
When a MCCDrainError alert is triggered the alert's message says that the drain problem is happening in the wrong node.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2022-12-22-120609 True False 4h59m Cluster version is 4.13.0-0.nightly-2022-12-22-120609
How reproducible:
Always
Steps to Reproduce:
1. Create a PodDisruptionBudget resource
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: dontevict
spec:
minAvailable: 1
selector:
matchLabels:
app: dontevict
2. Create a pod matching the PodDisruptionBudget
$ oc run --restart=Never --labels app=dontevict --image=docker.io/busybox dont-evict-this-pod -- sleep 3h
3. Create a MC
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: test-file
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
mode: 420
path: /etc/test
4. Wait 1 hour for the MCCDrainError alert to be triggered
Actual results:
The alert is like
$ curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring alertmanager-main -o jsonpath={.spec.host})/api/v1/alerts | jq
.....
{
"activeAt": "2022-12-23T11:24:05.807925776Z",
"annotations": {
"message": "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may be blocked. For more details check MachineConfigController pod logs: oc logs -f -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller"
},
"labels": {
"alertname": "MCCDrainError",
"container": "oauth-proxy",
"endpoint": "metrics",
"instance": "10.130.0.10:9001",
"job": "machine-config-controller",
"namespace": "openshift-machine-config-operator",
"node": "ip-10-0-193-114.us-east-2.compute.internal",
"pod": "machine-config-controller-5468769874-44tnt",
"service": "machine-config-controller",
"severity": "warning"
},
"state": "firing",
"value": "1e+00"
}
The alert message is wrong, since the reported not in "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may....." is not the node where the drain problem happened, but the node running the controller pod.
Expected results:
The alert message should not point to a wrong node, since it can mislead the user.
Additional info:
- is related to
-
MCO-420 Move MCD drain alert into the MCC, revisit error modes
-
- Closed
-
- links to