Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: odf-4.18
Affects Version/s: odf-4.14
Component/s: ceph-monitoring
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2237742
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Intelligence Requested:
Market:

Severity:
Low

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Created attachment 1987371 [details]
no-metrics-1

Description of problem (please be detailed as possible and provide log
snippests):

During the tests test_cephfs_capacity_workload_alerts, test_rbd_capacity_workload_alerts cluster is filled with data up to a ceph full ratio to trigger CephClusterNearFull and CephClusterCriticallyFull alerts.
During the test the Prometheus is been requested every 3 seconds to receive existing alerts.
Test spends avg 25 min to fill the storage for 100 Gb size cluster.

3 tests from 5 are failing with Prometheus crashed and not responding. Prometheus restores and starts to respond only after 4-5 minutes. During this time UI does not show any metrics update, Alerts, cluster utilization graphics (see attachments). curl request to https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts receives Error 504.

17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - Request https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts?silenced=False&inhibited=False failed
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - 504
17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

Version of all relevant components (if applicable):

OC version:
Client Version: 4.13.4
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2023-09-02-132842
Kubernetes Version: v1.27.4+2c83a9f

OCS verison:
ocs-operator.v4.14.0-125.stable OpenShift Container Storage 4.14.0-125.stable Succeeded

Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.0-0.nightly-2023-09-02-132842 True False 3d23h Cluster version is 4.14.0-0.nightly-2023-09-02-132842

Rook version:
rook: v4.14.0-0.194fd1e22bcb701c69a6d80f1b051f210ff89ee0
go: go1.20.5

Ceph version:
ceph version 17.2.6-115.el9cp (968b780fae1bced13d322da769a9d7223d701a01) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?

Can this issue reproduce from the UI?

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Create cephfs or RBD PVC with the size of 95% of the Cluster and attach it to a Deployment pod
2. Run load command via dd or other way on a pod to fill up the storage
3. Request periodically alerts via https://$ROUTE:443/api/v1/alerts
4. Open browser, login to management console with kubeadmin user to observe ODF storage Overview page

Actual results:

Prometheus returns response 504 on any request. Management console stops to show alerts / metrics / utilization on Overview pages

Expected results:

Prometheus returns resp on any legitimate request due to API documentation. Management console shows alerts on alerts page / metrics on metrics page/ utilization on Overview pages

Additional info:
Issue happens on deployments ODF 4.12, 4.13, 4.14

must-gather logs: https://drive.google.com/file/d/1i5Sf2T5XDKxdD4zKnf9AG78r-5x_brt2/view?usp=sharing - issue happened at 17:56:03

test log: http://pastebin.test.redhat.com/1108931

metrics:
ceph_cluster_total_used_bytes.json
{"status": "success", "data": {"resultType": "matrix", "result": [{"metric":

{"__name__": "ceph_cluster_total_used_bytes", "container": "mgr", "endpoint": "http-metrics", "instance": "10.131.0.24:9283", "job": "rook-ceph-mgr", "managedBy": "ocs-storagecluster", "namespace": "openshift-storage", "pod": "rook-ceph-mgr-a-68fb789468-6kqnf", "service": "rook-ceph-mgr"}

, "values": [[1694012693.494, "75253993472"], [1694012694.494, "75253993472"], [1694012695.494, "75253993472"], [1694012696.494, "75253993472"]]}]}}

cluster/memory_usage_bytes/sum.json
{"status": "success", "data": {"resultType": "matrix", "result": [{"metric":

{"__name__": "cluster:memory_usage_bytes:sum"}

, "values": [[1694012693.494, "53300436992"], [1694012694.494, "53300436992"], [1694012695.494, "53300436992"], [1694012696.494, "53300436992"]]}]}}

Assignee:: Arun Kumar Mohan

Reporter:: Daniel Osypenko

QA Contact:: Harish Nallur Vittal Rao

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/09/06 4:33 PM

Updated:: 2024/11/10 11:52 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty