Loading...

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: odf-4.16
Component/s: ceph-monitoring
Labels:
- Regression
- Reopened

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2292208
Dev Approval:
?
QE Approval:
?
Release Note Type:
If docs needed, set a value
Target Release:

odf-4.20
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

Repetitive issue when temporarily downscaling ceph monitor pod to 0, leaving 2 monitor pods running.

There are missing data points in the Ceph health status metrics retrieved from the Prometheus query range API. The expected data points, which should be recorded at 15-second intervals, show a gap of 45 seconds between 1715325580.828 and 1715325625.828. This indicates that three data points are missing within this range.

2024-05-10 03:06:55,827 - MainThread - INFO - tests.functional.monitoring.conftest.measure_stop_ceph_mon.144 - Monitors to stop: ['rook-ceph-mon-c']
2024-05-10 03:06:55,827 - MainThread - INFO - tests.functional.monitoring.conftest.measure_stop_ceph_mon.145 - Monitors left to run: ['rook-ceph-mon-a', 'rook-ceph-mon-b']
...
2024-05-10 03:21:09,670 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.get.431 - params=

{'query': 'ceph_health_status', 'start': 1715324815.827674, 'end': 1715325663.9413576, 'step': 15}

2024-05-10 03:21:09,672 - MainThread - DEBUG - urllib3.connectionpool._new_conn.1019 - Starting new HTTPS connection (1): prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443
2024-05-10 03:21:09,693 - MainThread - DEBUG - urllib3.connectionpool._make_request.474 - https://prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443 "GET /api/v1/query_range?query=ceph_health_status&start=1715324815.827674&end=1715325663.9413576&step=15 HTTP/1.1" 200 439
2024-05-10 03:21:09,705 - MainThread - DEBUG - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.validate_status.304 - content value: {'status': 'success', 'data': {'resultType': 'matrix', 'result': [{'metric':

{'__name__': 'ceph_health_status', 'container': 'mgr', 'endpoint': 'http-metrics', 'instance': '10.128.2.22:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-7ffddcf45f-nkb8z', 'service': 'rook-ceph-mgr'}

, 'values': [[1715324815.828, '0'], [1715324830.828, '0'], [1715324845.828, '0'], [1715324860.828, '0'], [1715324875.828, '1'], [1715324890.828, '1'], [1715324905.828, '1'], [1715324920.828, '1'], [1715324935.828, '1'], [1715324950.828, '1'], [1715324965.828, '1'], [1715324980.828, '1'], [1715324995.828, '1'], [1715325010.828, '1'], [1715325025.828, '1'], [1715325040.828, '1'], [1715325055.828, '1'], [1715325070.828, '1'], [1715325085.828, '1'], [1715325100.828, '1'], [1715325115.828, '1'], [1715325130.828, '1'], [1715325145.828, '1'], [1715325160.828, '1'], [1715325175.828, '1'], [1715325190.828, '1'], [1715325205.828, '1'], [1715325220.828, '1'], [1715325235.828, '1'], [1715325250.828, '1'], [1715325265.828, '1'], [1715325280.828, '1'], [1715325295.828, '1'], [1715325310.828, '1'], [1715325325.828, '1'], [1715325340.828, '1'], [1715325355.828, '1'], [1715325370.828, '1'], [1715325385.828, '1'], [1715325400.828, '1'], [1715325415.828, '1'], [1715325430.828, '1'], [1715325445.828, '1'], [1715325460.828, '1'], [1715325475.828, '1'], [1715325490.828, '1'], [1715325505.828, '1'], [1715325520.828, '1'], [1715325535.828, '1'], [1715325550.828, '1'], [1715325565.828, '1'], [1715325580.828, '1'], [1715325625.828, '0'], [1715325640.828, '0'], [1715325655.828, '0']]}]}}
2024-05-10 03:21:09,705 - MainThread - ERROR - /home/jenkins/workspace/qe-deploy-ocs-cluster-prod/ocs-ci/ocs_ci/utility/prometheus.py.query_range.597 - there are holes in prometheus data: result size is 55 while expected sample size is 56 +-1

Test test_monitoring_shows_mon_down is failing on a variety of platforms.

Version of all relevant components (if applicable):

Cluster version 4.16.0-0.nightly-2024-05-08-222442
ODF Operator 4.16.0-95
Test run name OCS4-16-Downstream-OCP4-16-VSPHERE6-IPI-1AZ-RHCOS-VSAN-3M-3W-tier3

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
-

If this is a regression, please provide more details to justify this:
regression, only 1/10 of tests are passing

Steps to Reproduce:
1. Downscale monitor pod replica. Make ranged Prometheus req similar to https://prometheus-k8s-openshift-monitoring.apps.j-075vi1cs33-t3.qe.rh-ocs.com:443 "GET /api/v1/query_range?query=ceph_health_status&start=1715324815.827674&end=1715325663.9413576&step=15 HTTP/1.1".
2.
3.

Actual results:
data holes detected

Expected results:
no data holes detected

Additional info:
test logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/ocs-ci-logs-1715321166/by_outcome/failed/tests/functional/monitoring/prometheus/metrics/test_monitoring_negative.py/test_monitoring_shows_mon_down/logs

must-gather logs OCS http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/failed_testcase_ocs_logs_1715321166/test_monitoring_shows_mon_down_ocs_logs/j-075vi1cs33-t3/ocs_must_gather/

must-gather logs OCP http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-075vi1cs33-t3/j-075vi1cs33-t3_20240510T004136/logs/testcases_1715321166/j-075vi1cs33-t3/ocp_must_gather/

links to

red-hat-storage/ocs-ci#11325: test_monitoring_shows_mon_down add jira decorator

red-hat-storage/ocs-ci#13059: [Bug]: test_monitoring_shows_mon_down make test tolerate holes in data points