-
Bug
-
Resolution: Done-Errata
-
Undefined
-
odf-4.16
-
None
Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:
PrometheusDuplicateTimestamps alerts generated by rook-ceph-osd-key-rotation-X pods. The pods rook-ceph-osd-key-rotation-X have tolerations defined 2 times because of which the metrics are getting duplicated.
The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):
All
The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):
4.16.4
The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):
4.16.z
Does this issue impact your ability to continue to work with the product?
There is no direct impact but the alert PrometheusDuplicateTimestamps keeps on popping up when rook-ceph-osd-key-rotation-X pods are completed.
Is there any workaround available to the best of your knowledge?
Deleting the completed jobs associated with rook-ceph-osd-key-rotation-X works in fixing the problem.
Can this issue be reproduced? If so, please provide the hit rate
100%
Can this issue be reproduced from the UI?
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Install and setup RHODF
2. Create StorageSystem, Enable Encryption using vault and enable taints on storage nodes as well.
3. Wait for the cronjobs rook-ceph-osd-key-rotation-X to be created.
4. Trigger jobs from these cronjobs so that new new pods are spin up.
5. After few minutes PrometheusDuplicateTimestamps
The exact date and time when the issue was observed, including timezone details:
Actual results:
PrometheusDuplicateTimestamps is streamed when pods are created by cronjobs rook-ceph-osd-key-rotation-X. This happens because of presence of duplicate tolerations for storage node in the associated cronjobs:
tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true"
Expected results:
When the cronjobs rook-ceph-osd-key-rotation-X are completed, then it should not trigger above said alert, and the associated pods should not have only duplicate toleration:
tolerations: - effect: NoSchedule key: node.ocs.openshift.io/storage operator: Equal value: "true" - effect: NoSchedule
Logs collected and log location:
Below are the logs seen in prometheus-k8s pods:
2024-12-27T18:39:01.605719007Z ts=2024-12-27T18:39:01.605Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://<pod-ip>:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"openshift-storage\",pod=\"rook-ceph-osd-key-rotation-21-28903680-49xxl\",uid=\"3a3c000a-1c8d-432c-a126-4f9291190902\",key=\"node.ocs.openshift.io/storage\",operator=\"Equal\",value=\"true\",effect=\"NoSchedule\"}"
Additional info:
- links to
-
RHBA-2024:138027 Red Hat OpenShift Data Foundation 4.18 security, enhancement & bug fix update