Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-500

[2237742] Prometheus stops responding. Error 504

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.18
    • odf-4.14
    • ceph-monitoring
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • Low
    • None

      Created attachment 1987371 [details]
      no-metrics-1

      Description of problem (please be detailed as possible and provide log
      snippests):

      During the tests test_cephfs_capacity_workload_alerts, test_rbd_capacity_workload_alerts cluster is filled with data up to a ceph full ratio to trigger CephClusterNearFull and CephClusterCriticallyFull alerts.
      During the test the Prometheus is been requested every 3 seconds to receive existing alerts.
      Test spends avg 25 min to fill the storage for 100 Gb size cluster.

      3 tests from 5 are failing with Prometheus crashed and not responding. Prometheus restores and starts to respond only after 4-5 minutes. During this time UI does not show any metrics update, Alerts, cluster utilization graphics (see attachments). curl request to https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts receives Error 504.

      17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - Request https://prometheus-k8s-openshift-monitoring.apps.dosypenk-39.qe.rh-ocs.com/api/v1/alerts?silenced=False&inhibited=False failed
      17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - 504
      17:56:03 - Thread-2 - ocs_ci.utility.workloadfixture - ERROR - <html><body><h1>504 Gateway Time-out</h1>
      The server didn't respond in time.
      </body></html>

      Version of all relevant components (if applicable):

      OC version:
      Client Version: 4.13.4
      Kustomize Version: v4.5.7
      Server Version: 4.14.0-0.nightly-2023-09-02-132842
      Kubernetes Version: v1.27.4+2c83a9f

      OCS verison:
      ocs-operator.v4.14.0-125.stable OpenShift Container Storage 4.14.0-125.stable Succeeded

      Cluster version
      NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
      version 4.14.0-0.nightly-2023-09-02-132842 True False 3d23h Cluster version is 4.14.0-0.nightly-2023-09-02-132842

      Rook version:
      rook: v4.14.0-0.194fd1e22bcb701c69a6d80f1b051f210ff89ee0
      go: go1.20.5

      Ceph version:
      ceph version 17.2.6-115.el9cp (968b780fae1bced13d322da769a9d7223d701a01) quincy (stable)

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Create cephfs or RBD PVC with the size of 95% of the Cluster and attach it to a Deployment pod
      2. Run load command via dd or other way on a pod to fill up the storage
      3. Request periodically alerts via https://$ROUTE:443/api/v1/alerts
      4. Open browser, login to management console with kubeadmin user to observe ODF storage Overview page

      Actual results:

      Prometheus returns response 504 on any request. Management console stops to show alerts / metrics / utilization on Overview pages

      Expected results:

      Prometheus returns resp on any legitimate request due to API documentation. Management console shows alerts on alerts page / metrics on metrics page/ utilization on Overview pages

      Additional info:
      Issue happens on deployments ODF 4.12, 4.13, 4.14

      must-gather logs: https://drive.google.com/file/d/1i5Sf2T5XDKxdD4zKnf9AG78r-5x_brt2/view?usp=sharing - issue happened at 17:56:03

      test log: http://pastebin.test.redhat.com/1108931

      metrics:
      ceph_cluster_total_used_bytes.json
      {"status": "success", "data": {"resultType": "matrix", "result": [{"metric":

      {"__name__": "ceph_cluster_total_used_bytes", "container": "mgr", "endpoint": "http-metrics", "instance": "10.131.0.24:9283", "job": "rook-ceph-mgr", "managedBy": "ocs-storagecluster", "namespace": "openshift-storage", "pod": "rook-ceph-mgr-a-68fb789468-6kqnf", "service": "rook-ceph-mgr"}

      , "values": [[1694012693.494, "75253993472"], [1694012694.494, "75253993472"], [1694012695.494, "75253993472"], [1694012696.494, "75253993472"]]}]}}

      cluster/memory_usage_bytes/sum.json
      {"status": "success", "data": {"resultType": "matrix", "result": [{"metric":

      {"__name__": "cluster:memory_usage_bytes:sum"}

      , "values": [[1694012693.494, "53300436992"], [1694012694.494, "53300436992"], [1694012695.494, "53300436992"], [1694012696.494, "53300436992"]]}]}}

              aruniiird Arun Kumar Mohan
              rh-ee-dosypenk Daniel Osypenko
              Harish Nallur Vittal Rao Harish Nallur Vittal Rao
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: