-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
odf-4.18, odf-4.17
-
None
Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:
---------------------------------------------------------------------------------------
On OCP clusters with ODF, the PrometheusDuplicateTimestamps alert is firing:
$ TOKEN=$(oc whoami --show-token) $ ROUTE=$(oc get routes -n openshift-monitoring -o jsonpath='{.items[?(@.metadata.name=="prometheus-k8s")].spec.host}') $ curl -k -H "Authorization: Bearer $TOKEN" -X GET https://$ROUTE/api/v1/alerts | jq '.data.alerts[].labels.alertname' | grep -i duplicate % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5229 0 5229 0 0 8107 0 --:--:-- --:--:-- --:--:-- 8119 "PrometheusDuplicateTimestamps" "PrometheusDuplicateTimestamps"
By enabling debug logs and examining the prom pods, we found the source is the rook-ceph-osd-prepare-ocs-deviceset pods in the openshift-storage namespace:
# Enable debug logs oc patch configmap cluster-monitoring-config -n openshift-monitoring --type merge -p '{"data":{"config.yaml":"prometheusK8s:\n logLevel: debug\n"}}' # Confirm the change oc logs prometheus-k8s-0 -n openshift-monitoring | grep level=debug &>/dev/null ; echo $? 0 oc logs prometheus-k8s-0 -n openshift-monitoring | grep "Duplicate sample for timestamp" ts=2024-11-12T13:19:52.455Z caller=scrape.go:1859 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://10.128.2.13:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"openshift-storage\",pod=\"rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg\",uid=\"7f3adda0-e147-4196-a239-5376721bc0b8\",key=\"node.ocs.openshift.io/storage\",operator=\"Equal\",value=\"true\",effect=\"NoSchedule\"}" ts=2024-11-12T13:19:52.456Z caller=scrape.go:1859 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://10.128.2.13:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"openshift-storage\",pod=\"rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs\",uid=\"5f3ae9ca-3e91-4de2-b0d3-4906b6f30694\",key=\"node.ocs.openshift.io/storage\",operator=\"Equal\",value=\"true\",effect=\"NoSchedule\"}" ts=2024-11-12T13:19:52.460Z caller=scrape.go:1859 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://10.128.2.13:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"openshift-storage\",pod=\"rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g\",uid=\"46165a5e-7318-4625-94ca-35714fec65c5\",key=\"node.ocs.openshift.io/storage\",operator=\"Equal\",value=\"true\",effect=\"NoSchedule\"}"
Curling the target (10.128.2.13:8443/metrics) and looking for the series confirm the existence of the duplicates:
$ oc exec prometheus-k8s-1 -n openshift-monitoring -- curl -k -H "Authorization: Bearer $TOKEN" "https://10.128.2.13:8443/metrics" | grep kube_pod_tolerations | grep rook-ceph-osd-prepare-ocs-devicese % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg",uid="7f3adda0-e147-4196-a239-5376721bc0b8",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg",uid="7f3adda0-e147-4196-a239-5376721bc0b8",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg",uid="7f3adda0-e147-4196-a239-5376721bc0b8",key="node.kubernetes.io/not-ready",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg",uid="7f3adda0-e147-4196-a239-5376721bc0b8",key="node.kubernetes.io/unreachable",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs",uid="5f3ae9ca-3e91-4de2-b0d3-4906b6f30694",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs",uid="5f3ae9ca-3e91-4de2-b0d3-4906b6f30694",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs",uid="5f3ae9ca-3e91-4de2-b0d3-4906b6f30694",key="node.kubernetes.io/not-ready",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs",uid="5f3ae9ca-3e91-4de2-b0d3-4906b6f30694",key="node.kubernetes.io/unreachable",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g",uid="46165a5e-7318-4625-94ca-35714fec65c5",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g",uid="46165a5e-7318-4625-94ca-35714fec65c5",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g",uid="46165a5e-7318-4625-94ca-35714fec65c5",key="node.kubernetes.io/not-ready",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1 kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g",uid="46165a5e-7318-4625-94ca-35714fec65c5",key="node.kubernetes.io/unreachable",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1
This issue was first introduced in the following BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2322896
Which was later moved to Jira:
https://issues.redhat.com/browse/DFBUGS-642
Since it was verified that MCG no longer produces duplicate metrics (per my comment in the BZ),] it was decided to open a new bug against a different ODF component.
The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):
---------------------------------------------------------------------------------------
Reproduced on IBM Cloud
The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):
---------------------------------------------------------------------------------------
Reproduced using Internal deployment type
The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):
---------------------------------------------------------------------------------------
OCP: 4.18.0-0.nightly-2024-11-09-194014
ODF: 4.18.0-47
ceph: 19.2.0-47.el9cp (123a317ae596caa7f6d087fc76fffb6a736e0b5f) squid (stable)
rook: v4.18.0-0.c455c6812cde09cfa9bc5bb03f3454db0208e263
Is there any workaround available to the best of your knowledge?
---------------------------------------------------------------------------------------
No
Can this issue be reproduced? If so, please provide the hit rate
---------------------------------------------------------------------------------------
Yes - so far I have reproduced it in both IBM Cloud and vSphere platforms, both on ODF 4.17 and 4.18
Steps to Reproduce:
---------------------------------------------------------------------------------------
1. In the description
The exact date and time when the issue was observed, including timezone details:
---------------------------------------------------------------------------------------
November 12th, 2024, 13:34
Logs collected and log location:
---------------------------------------------------------------------------------------
must-gather logs: https://drive.google.com/drive/folders/1IWscbqRNfWd-jhaPmxptkIXvA2wiZnIo?usp=drive_link
- is related to
-
DFBUGS-642 [2304076] duplicate metrics being produced
- Closed