Resolution: Done-Errata
4.11, 4.17
Description of problem:
MultipleDefaultStorageClasses alert has incorrect rules because it does not deactivate right after user fixes the cluster to have only 1 storage class but is active for another ~5 minutes after the fix is applied.
Version-Release number of selected component (if applicable):
OCP 4.11+
How reproducible:
always (platform independent, reproducible with any driver and storage class)
Steps to Reproduce:
Set additional storage class as default ``` $ oc patch storageclass gp2-csi -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'storageclass.storage.k8s.io/gp2-csi patched ``` Check that prometheus metrics is now > 1 ``` $ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s --data-urlencode "query=default_storage_class_count" http://localhost:9090/api/v1/query | jq -r '.data.result[0].value[1]'2 ``` Wait at least 5 minutes for alert to be `pending`, after 10 minutes the alert starts `firing` ``` $ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"'MultipleDefaultStorageClasses - firing ``` Annotate storage class as non default, making sure there's only one default now ``` $ oc patch storageclass gp2-csi -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "false"}}}'storageclass.storage.k8s.io/gp2-csi patched ``` Alert is still present for 5 minutes but should have disappeared immediately - this is the actual bug ``` $ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"'MultipleDefaultStorageClasses - firing ``` After 5 minutes alert is gone ``` $ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)" ``` Root cause -> the alerting rule is set to get `max_over_time` but it should be `min_over_time` here: https://github.com/openshift/cluster-storage-operator/blob/7b4d8861d8f9364d63ad9a58347c2a7a014bff70/manifests/12_prometheusrules.yaml#L19
Additional info:
To verify changes follow the same procedure and verify that the alert is gone right after the settings are fixed (meaning there's only 1 default storage class again). Changes are tricky to test -> on a live cluster, changing the Prometheus rule won't work as it will get reconciled by CSO, but if CSO is scaled down to prevent this then metrics are not collected. I'd suggest testing this by editing CSO code, scaling down CSO+CVO and running CSO locally, see README with instructions how to do it: https://github.com/openshift/cluster-storage-operator/blob/master/README.md
- links to
RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update