-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.16
Description of problem (please be detailed as possible and provide log
snippests):
Repetitive issue when temporarily downscaling ceph monitor pod to 0, leaving 2 monitor pods running.
There are missing data points in the Ceph health status metrics retrieved from the Prometheus query range API. The expected data points, which should be recorded at 15-second intervals, show a gap of 45 seconds between 1715325580.828 and 1715325625.828. This indicates that three data points are missing within this range.
2024-05-10 03:06:55,827 - MainThread - INFO - tests.functional.monitoring.conftest.measure_stop_ceph_mon.144 - Monitors to stop: ['rook-ceph-mon-c']
2024-05-10 03:06:55,827 - MainThread - INFO - tests.functional.monitoring.conftest.measure_stop_ceph_mon.145 - Monitors left to run: ['rook-ceph-mon-a', 'rook-ceph-mon-b']
...
2024-05-10 03:21:09,670 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.get.431 - params=
2024-05-10 03:21:09,672 - MainThread - DEBUG - urllib3.connectionpool._new_conn.1019 - Starting new HTTPS connection (1): prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443
2024-05-10 03:21:09,693 - MainThread - DEBUG - urllib3.connectionpool._make_request.474 - https://prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443 "GET /api/v1/query_range?query=ceph_health_status&start=1715324815.827674&end=1715325663.9413576&step=15 HTTP/1.1" 200 439
2024-05-10 03:21:09,705 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.validate_status.304 - content value: {'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric':
, 'values': [[1715324815.828, '0'], [1715324830.828, '0'], [1715324845.828, '0'], [1715324860.828, '0'], [1715324875.828, '1'], [1715324890.828, '1'], [1715324905.828, '1'], [1715324920.828, '1'], [1715324935.828, '1'], [1715324950.828, '1'], [1715324965.828, '1'], [1715324980.828, '1'], [1715324995.828, '1'], [1715325010.828, '1'], [1715325025.828, '1'], [1715325040.828, '1'], [1715325055.828, '1'], [1715325070.828, '1'], [1715325085.828, '1'], [1715325100.828, '1'], [1715325115.828, '1'], [1715325130.828, '1'], [1715325145.828, '1'], [1715325160.828, '1'], [1715325175.828, '1'], [1715325190.828, '1'], [1715325205.828, '1'], [1715325220.828, '1'], [1715325235.828, '1'], [1715325250.828, '1'], [1715325265.828, '1'], [1715325280.828, '1'], [1715325295.828, '1'], [1715325310.828, '1'], [1715325325.828, '1'], [1715325340.828, '1'], [1715325355.828, '1'], [1715325370.828, '1'], [1715325385.828, '1'], [1715325400.828, '1'], [1715325415.828, '1'], [1715325430.828, '1'], [1715325445.828, '1'], [1715325460.828, '1'], [1715325475.828, '1'], [1715325490.828, '1'], [1715325505.828, '1'], [1715325520.828, '1'], [1715325535.828, '1'], [1715325550.828, '1'], [1715325565.828, '1'], [1715325580.828, '1'], [1715325625.828, '0'], [1715325640.828, '0'], [1715325655.828, '0']]}]}}
2024-05-10 03:21:09,705 - MainThread - ERROR - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.query_range.597 - there are holes in prometheus data: result size is 55 while expected sample size is 56 +-1
Test test_monitoring_shows_mon_down is failing on a variety of platforms.
Version of all relevant components (if applicable):
Cluster version 4.16.0-0.nightly-2024-05-08-222442
ODF Operator 4.16.0-95
Test run name OCS4-16-Downstream-OCP4-16-VSPHERE6-IPI-1AZ-RHCOS-VSAN-3M-3W-tier3
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Is there any workaround available to the best of your knowledge?
no
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
yes
Can this issue reproduce from the UI?
-
If this is a regression, please provide more details to justify this:
regression, only 1/10 of tests are passing
Steps to Reproduce:
1. Downscale monitor pod replica. Make ranged Prometheus req similar to https://prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443 "GET /api/v1/query_range?query=ceph_health_status&start=1715324815.827674&end=1715325663.9413576&step=15 HTTP/1.1".
2.
3.
Actual results:
data holes detected
Expected results:
no data holes detected
must-gather logs OCP http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/testcases_1715321166/j-075vi1cs33-t3/ocp_must_gather/