Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-308

[2292208] Missing Data Points in Ceph Health Status Metrics when one of the monitors is downscaled

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Repetitive issue when temporarily downscaling ceph monitor pod to 0, leaving 2 monitor pods running.

      There are missing data points in the Ceph health status metrics retrieved from the Prometheus query range API. The expected data points, which should be recorded at 15-second intervals, show a gap of 45 seconds between 1715325580.828 and 1715325625.828. This indicates that three data points are missing within this range.

      2024-05-10 03:06:55,827 - MainThread - INFO - tests.functional.monitoring.conftest.measure_stop_ceph_mon.144 - Monitors to stop: ['rook-ceph-mon-c']
      2024-05-10 03:06:55,827 - MainThread - INFO - tests.functional.monitoring.conftest.measure_stop_ceph_mon.145 - Monitors left to run: ['rook-ceph-mon-a', 'rook-ceph-mon-b']
      ...
      2024-05-10 03:21:09,670 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.get.431 - params=

      {'query': 'ceph_health_status', 'start': 1715324815.827674, 'end': 1715325663.9413576, 'step': 15}

      2024-05-10 03:21:09,672 - MainThread - DEBUG - urllib3.connectionpool._new_conn.1019 - Starting new HTTPS connection (1): prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443
      2024-05-10 03:21:09,693 - MainThread - DEBUG - urllib3.connectionpool._make_request.474 - https://prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443 "GET /api/v1/query_range?query=ceph_health_status&start=1715324815.827674&end=1715325663.9413576&step=15 HTTP/1.1" 200 439
      2024-05-10 03:21:09,705 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.validate_status.304 - content value: {'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric':

      {'__name__': 'ceph_health_status', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.128.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7ffddcf45f-nkb8z', 'service': 'rook-ceph-mgr'}

      , 'values': [[1715324815.828, '0'], [1715324830.828, '0'], [1715324845.828, '0'], [1715324860.828, '0'], [1715324875.828, '1'], [1715324890.828, '1'], [1715324905.828, '1'], [1715324920.828, '1'], [1715324935.828, '1'], [1715324950.828, '1'], [1715324965.828, '1'], [1715324980.828, '1'], [1715324995.828, '1'], [1715325010.828, '1'], [1715325025.828, '1'], [1715325040.828, '1'], [1715325055.828, '1'], [1715325070.828, '1'], [1715325085.828, '1'], [1715325100.828, '1'], [1715325115.828, '1'], [1715325130.828, '1'], [1715325145.828, '1'], [1715325160.828, '1'], [1715325175.828, '1'], [1715325190.828, '1'], [1715325205.828, '1'], [1715325220.828, '1'], [1715325235.828, '1'], [1715325250.828, '1'], [1715325265.828, '1'], [1715325280.828, '1'], [1715325295.828, '1'], [1715325310.828, '1'], [1715325325.828, '1'], [1715325340.828, '1'], [1715325355.828, '1'], [1715325370.828, '1'], [1715325385.828, '1'], [1715325400.828, '1'], [1715325415.828, '1'], [1715325430.828, '1'], [1715325445.828, '1'], [1715325460.828, '1'], [1715325475.828, '1'], [1715325490.828, '1'], [1715325505.828, '1'], [1715325520.828, '1'], [1715325535.828, '1'], [1715325550.828, '1'], [1715325565.828, '1'], [1715325580.828, '1'], [1715325625.828, '0'], [1715325640.828, '0'], [1715325655.828, '0']]}]}}
      2024-05-10 03:21:09,705 - MainThread - ERROR - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.query_range.597 - there are holes in prometheus data: result size is 55 while expected sample size is 56 +-1

      Test test_monitoring_shows_mon_down is failing on a variety of platforms.

      Version of all relevant components (if applicable):

      Cluster version 4.16.0-0.nightly-2024-05-08-222442
      ODF Operator 4.16.0-95
      Test run name OCS4-16-Downstream-OCP4-16-VSPHERE6-IPI-1AZ-RHCOS-VSAN-3M-3W-tier3

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?
      no

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?
      yes

      Can this issue reproduce from the UI?
      -

      If this is a regression, please provide more details to justify this:
      regression, only 1/10 of tests are passing

      Steps to Reproduce:
      1. Downscale monitor pod replica. Make ranged Prometheus req similar to https://prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443 "GET /api/v1/query_range?query=ceph_health_status&start=1715324815.827674&end=1715325663.9413576&step=15 HTTP/1.1".
      2.
      3.

      Actual results:
      data holes detected

      Expected results:
      no data holes detected

      Additional info:
      test logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/ocs-ci-logs-1715321166/by_outcome/failed/tests/functional/monitoring/prometheus/metrics/test_monitoring_negative.py/test_monitoring_shows_mon_down/logs

      must-gather logs OCS http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/failed_testcase_ocs_logs_1715321166/test_monitoring_shows_mon_down_ocs_logs/j-075vi1cs33-t3/ocs_must_gather/

      must-gather logs OCP http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/testcases_1715321166/j-075vi1cs33-t3/ocp_must_gather/

              dkamboj@redhat.com Divyansh Kamboj
              rh-ee-dosypenk Daniel Osypenko
              Harish Nallur Vittal Rao Harish Nallur Vittal Rao
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: