-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.14
-
+
-
Moderate
-
No
-
MCO Sprint 237, MCO Sprint 238
-
2
-
False
-
Description of problem:
When a MCCPoolAlert is fired and we fix the problem that caused this alert, the alert is not removed.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-06-06-212044 True False 114m Cluster version is 4.14.0-0.nightly-2023-06-06-212044
How reproducible:
Always
Steps to Reproduce:
1. Create a custom MCP apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [master,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" 2. Label a master node so that it is included in the new custom MCP $ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra="" 3. Verify that the alert is fired alias thanosalerts='curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring thanos-querier -o jsonpath={.spec.host})/api/v1/alerts | jq ' $ thanosalerts |grep alertname .... "alertname": "MCCPoolAlert", 4. Remove the label from the node to fix the problem $ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra-
Actual results:
The alert is not removed. When we have a look at the mcc_pool_alert metric we find 2 values with 2 different "alert" fields. alias thanosquery='function __lgb() { unset -f __lgb; oc rsh -n openshift-monitoring prometheus-k8s-0 curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" --data-urlencode "query=$1" https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query | jq -c | jq; }; __lgb' $ thanosquery mcc_pool_alert { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "mcc_pool_alert", "alert": "Applying custom label for pool", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.130.0.86:9001", "job": "machine-config-controller", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-129-20.us-east-2.compute.internal", "pod": "machine-config-controller-76dbddff49-75ggr", "pool": "infra", "prometheus": "openshift-monitoring/k8s", "service": "machine-config-controller" }, "value": [ 1686137977.158, "0" ] }, { "metric": { "__name__": "mcc_pool_alert", "alert": "Given both master and custom pools. Defaulting to master: custom infra", "container": "oauth-proxy", "endpoint": "metrics", "instance": "10.130.0.86:9001", "job": "machine-config-controller", "namespace": "openshift-machine-config-operator", "node": "ip-10-0-129-20.us-east-2.compute.internal", "pod": "machine-config-controller-76dbddff49-75ggr", "pool": "infra", "prometheus": "openshift-monitoring/k8s", "service": "machine-config-controller" }, "value": [ 1686137977.158, "1" ] } ] } }
Expected results:
The alert should be removed.
Additional info:
If we remove the MCO controller pod, a new mcc_pool_alert data is generated with the right value and the other values are removed. If we execute this workaround the alert is removed.
- links to
-
RHSA-2023:5006 OpenShift Container Platform 4.14.z security update