-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
No
-
None
-
None
-
MCO Sprint 237, MCO Sprint 238
-
2
-
+
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When a MCCPoolAlert is fired and we fix the problem that caused this alert, the alert is not removed.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-06-06-212044 True False 114m Cluster version is 4.14.0-0.nightly-2023-06-06-212044
How reproducible:
Always
Steps to Reproduce:
1. Create a custom MCP
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: infra
spec:
machineConfigSelector:
matchExpressions:
- {key: machineconfiguration.openshift.io/role, operator: In, values: [master,infra]}
nodeSelector:
matchLabels:
node-role.kubernetes.io/infra: ""
2. Label a master node so that it is included in the new custom MCP
$ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra=""
3. Verify that the alert is fired
alias thanosalerts='curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring thanos-querier -o jsonpath={.spec.host})/api/v1/alerts | jq '
$ thanosalerts |grep alertname
....
"alertname": "MCCPoolAlert",
4. Remove the label from the node to fix the problem
$ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra-
Actual results:
The alert is not removed.
When we have a look at the mcc_pool_alert metric we find 2 values with 2 different "alert" fields.
alias thanosquery='function __lgb() { unset -f __lgb; oc rsh -n openshift-monitoring prometheus-k8s-0 curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" --data-urlencode "query=$1" https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query | jq -c | jq; }; __lgb'
$ thanosquery mcc_pool_alert
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "mcc_pool_alert",
"alert": "Applying custom label for pool",
"container": "oauth-proxy",
"endpoint": "metrics",
"instance": "10.130.0.86:9001",
"job": "machine-config-controller",
"namespace": "openshift-machine-config-operator",
"node": "ip-10-0-129-20.us-east-2.compute.internal",
"pod": "machine-config-controller-76dbddff49-75ggr",
"pool": "infra",
"prometheus": "openshift-monitoring/k8s",
"service": "machine-config-controller"
},
"value": [
1686137977.158,
"0"
]
},
{
"metric": {
"__name__": "mcc_pool_alert",
"alert": "Given both master and custom pools. Defaulting to master: custom infra",
"container": "oauth-proxy",
"endpoint": "metrics",
"instance": "10.130.0.86:9001",
"job": "machine-config-controller",
"namespace": "openshift-machine-config-operator",
"node": "ip-10-0-129-20.us-east-2.compute.internal",
"pod": "machine-config-controller-76dbddff49-75ggr",
"pool": "infra",
"prometheus": "openshift-monitoring/k8s",
"service": "machine-config-controller"
},
"value": [
1686137977.158,
"1"
]
}
]
}
}
Expected results:
The alert should be removed.
Additional info:
If we remove the MCO controller pod, a new mcc_pool_alert data is generated with the right value and the other values are removed. If we execute this workaround the alert is removed.
- links to
-
RHSA-2023:5006
OpenShift Container Platform 4.14.z security update