Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.11, 4.17
Component/s: Storage / Operators
Labels:
None

Severity:
Low
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.17.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

MultipleDefaultStorageClasses alert has incorrect rules because it does not deactivate right after user fixes the cluster to have only 1 storage class but is active for another ~5 minutes after the fix is applied.

Version-Release number of selected component (if applicable):

OCP 4.11+

How reproducible:

always (platform independent, reproducible with any driver and storage class)

Steps to Reproduce:

Set additional storage class as default
```
$ oc patch storageclass gp2-csi -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'storageclass.storage.k8s.io/gp2-csi patched
```

Check that prometheus metrics is now > 1
```
$ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s --data-urlencode "query=default_storage_class_count" http://localhost:9090/api/v1/query | jq -r '.data.result[0].value[1]'2
```

Wait at least 5 minutes for alert to be `pending`, after 10 minutes the alert starts `firing`
```
$ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"'MultipleDefaultStorageClasses - firing
```

Annotate storage class as non default, making sure there's only one default now
```
$ oc patch storageclass gp2-csi -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "false"}}}'storageclass.storage.k8s.io/gp2-csi patched
```

Alert is still present for 5 minutes but should have disappeared immediately - this is the actual bug
```
$ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"'MultipleDefaultStorageClasses - firing
```

After 5 minutes alert is gone
```
$ oc exec -c prometheus -n openshift-monitoring prometheus-k8s-0 -- curl -s http://localhost:9090/api/v1/alerts | jq -r '.data.alerts[] | select(.labels.alertname == "MultipleDefaultStorageClasses") | "\(.labels.alertname) - \(.state)"
```

Root cause -> the alerting rule is set to get `max_over_time` but it should be `min_over_time` here:
https://github.com/openshift/cluster-storage-operator/blob/7b4d8861d8f9364d63ad9a58347c2a7a014bff70/manifests/12_prometheusrules.yaml#L19

Additional info:

To verify changes follow the same procedure and verify that the alert is gone right after the settings are fixed (meaning there's only 1 default storage class again).

Changes are tricky to test -> on a live cluster, changing the Prometheus rule won't work as it will get reconciled by CSO, but if CSO is scaled down to prevent this then metrics are not collected. I'd suggest testing this by editing CSO code, scaling down CSO+CVO and running CSO locally, see README with instructions how to do it: https://github.com/openshift/cluster-storage-operator/blob/master/README.md

links to

openshift/cluster-storage-operator#483: OCPBUGS-36169: deactivate MultipleDefaultStorageClasses alert immediately after being fixed

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

Assignee:: Richard Hrmo

Reporter:: Roman Bednar

QA Contact:: Rohit Patil

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/06/25 2:37 PM

Updated:: 2024/10/01 5:38 PM

Resolved:: 2024/10/01 5:38 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates