-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.14
-
None
Created attachment 1987371 [details]
no-metrics-1
Description of problem (please be detailed as possible and provide log
snippests):
During the tests test_cephfs_capacity_workload_alerts, test_rbd_capacity_workload_alerts cluster is filled with data up to a ceph full ratio to trigger CephClusterNearFull and CephClusterCriticallyFull alerts.
During the test the Prometheus is been requested every 3 seconds to receive existing alerts.
Test spends avg 25 min to fill the storage for 100 Gb size cluster.
3 tests from 5 are failing with Prometheus crashed and not responding. Prometheus restores and starts to respond only after 4-5 minutes. During this time UI does not show any metrics update, Alerts, cluster utilization graphics (see attachments). curl request to https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts receives Error 504.
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - Request https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts?silenced=False&inhibited=False failed
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - 504
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
Version of all relevant components (if applicable):
OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-09-02-132842
Kubernetes Version: v1.27.4+2c83a9f
OCS verison:
ocs-operator.v4.14.0-125.stable OpenShift Container Storage 4.14.0-125.stable Succeeded
Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.0-0.nightly-2023-09-02-132842 True False 3d23h Cluster version is 4.14.0-0.nightly-2023-09-02-132842
Rook version:
rook: v4.14.0-0.194fd1e22bcb701c69a6d80f1b051f210ff89ee0
go: go1.20.5
Ceph version:
ceph version 17.2.6-115.el9cp (968b780fae1bced13d322da769a9d7223d701a01) quincy (stable)
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Create cephfs or RBD PVC with the size of 95% of the Cluster and attach it to a Deployment pod
2. Run load command via dd or other way on a pod to fill up the storage
3. Request periodically alerts via https://$ROUTE:443/api/v1/alerts
4. Open browser, login to management console with kubeadmin user to observe ODF storage Overview page
Actual results:
Prometheus returns response 504 on any request. Management console stops to show alerts / metrics / utilization on Overview pages
Expected results:
Prometheus returns resp on any legitimate request due to API documentation. Management console shows alerts on alerts page / metrics on metrics page/ utilization on Overview pages
Additional info:
Issue happens on deployments ODF 4.12, 4.13, 4.14
must-gather logs: https://drive.google.com/file/d/1i5Sf2T5XDKxdD4zKnf9AG78r-5x_brt2/view?usp=sharing - issue happened at 17:56:03
test log: http://pastebin.test.redhat.com/1108931
metrics:
ceph_cluster_total_used_bytes.json
{"status": "success", "data": {"resultType": "matrix", "result": [{"metric":
, "values": [[1694012693.494, "75253993472"], [1694012694.494, "75253993472"], [1694012695.494, "75253993472"], [1694012696.494, "75253993472"]]}]}}
cluster/memory_usage_bytes/sum.json
{"status": "success", "data": {"resultType": "matrix", "result": [{"metric":
, "values": [[1694012693.494, "53300436992"], [1694012694.494, "53300436992"], [1694012695.494, "53300436992"], [1694012696.494, "53300436992"]]}]}}