Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-839

rook-ceph-osd-prepare-ocs-deviceset pods produce duplicate metrics

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.18, odf-4.17
    • ceph-monitoring
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • x86_64
    • ?
    • ?
    • Critical
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      ---------------------------------------------------------------------------------------

      On OCP clusters with ODF, the PrometheusDuplicateTimestamps alert is firing:

      $ TOKEN=$(oc whoami --show-token)
      $ ROUTE=$(oc get routes -n openshift-monitoring -o jsonpath='{.items[?(@.metadata.name=="prometheus-k8s")].spec.host}')
      $ curl -k -H "Authorization: Bearer $TOKEN" -X GET https://$ROUTE/api/v1/alerts | jq '.data.alerts[].labels.alertname' | grep -i duplicate
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100  5229    0  5229    0     0   8107      0 --:--:-- --:--:-- --:--:--  8119
      "PrometheusDuplicateTimestamps"
      "PrometheusDuplicateTimestamps" 

      By enabling debug logs and examining the prom pods, we found the source is the rook-ceph-osd-prepare-ocs-deviceset pods in the openshift-storage namespace:

      # Enable debug logs
      oc patch configmap cluster-monitoring-config -n openshift-monitoring --type merge -p '{"data":{"config.yaml":"prometheusK8s:\n  logLevel: debug\n"}}' 
      
      # Confirm the change
      oc logs prometheus-k8s-0 -n openshift-monitoring | grep level=debug &>/dev/null ; echo $?
      0
      
      oc logs prometheus-k8s-0 -n openshift-monitoring | grep "Duplicate sample for timestamp"
      ts=2024-11-12T13:19:52.455Z caller=scrape.go:1859 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://10.128.2.13:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"openshift-storage\",pod=\"rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg\",uid=\"7f3adda0-e147-4196-a239-5376721bc0b8\",key=\"node.ocs.openshift.io/storage\",operator=\"Equal\",value=\"true\",effect=\"NoSchedule\"}"
      ts=2024-11-12T13:19:52.456Z caller=scrape.go:1859 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://10.128.2.13:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"openshift-storage\",pod=\"rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs\",uid=\"5f3ae9ca-3e91-4de2-b0d3-4906b6f30694\",key=\"node.ocs.openshift.io/storage\",operator=\"Equal\",value=\"true\",effect=\"NoSchedule\"}"
      ts=2024-11-12T13:19:52.460Z caller=scrape.go:1859 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/kube-state-metrics/0 target=https://10.128.2.13:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"openshift-storage\",pod=\"rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g\",uid=\"46165a5e-7318-4625-94ca-35714fec65c5\",key=\"node.ocs.openshift.io/storage\",operator=\"Equal\",value=\"true\",effect=\"NoSchedule\"}"

      Curling the target (10.128.2.13:8443/metrics) and looking for the series confirm the existence of the duplicates:

      $ oc exec prometheus-k8s-1 -n openshift-monitoring -- curl -k -H "Authorization: Bearer $TOKEN" "https://10.128.2.13:8443/metrics" |  grep kube_pod_tolerations | grep rook-ceph-osd-prepare-ocs-devicese
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
        0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg",uid="7f3adda0-e147-4196-a239-5376721bc0b8",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg",uid="7f3adda0-e147-4196-a239-5376721bc0b8",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg",uid="7f3adda0-e147-4196-a239-5376721bc0b8",key="node.kubernetes.io/not-ready",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-2-data-0vv7ps-jnwpg",uid="7f3adda0-e147-4196-a239-5376721bc0b8",key="node.kubernetes.io/unreachable",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs",uid="5f3ae9ca-3e91-4de2-b0d3-4906b6f30694",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs",uid="5f3ae9ca-3e91-4de2-b0d3-4906b6f30694",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs",uid="5f3ae9ca-3e91-4de2-b0d3-4906b6f30694",key="node.kubernetes.io/not-ready",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-1-data-0ssgtq-5ghjs",uid="5f3ae9ca-3e91-4de2-b0d3-4906b6f30694",key="node.kubernetes.io/unreachable",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g",uid="46165a5e-7318-4625-94ca-35714fec65c5",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g",uid="46165a5e-7318-4625-94ca-35714fec65c5",key="node.ocs.openshift.io/storage",operator="Equal",value="true",effect="NoSchedule"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g",uid="46165a5e-7318-4625-94ca-35714fec65c5",key="node.kubernetes.io/not-ready",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1
      kube_pod_tolerations{namespace="openshift-storage",pod="rook-ceph-osd-prepare-ocs-deviceset-0-data-0h7k8r-mpf2g",uid="46165a5e-7318-4625-94ca-35714fec65c5",key="node.kubernetes.io/unreachable",operator="Exists",effect="NoExecute",toleration_seconds="300"} 1 

       

       

      This issue was first introduced in the following BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2322896

      Which was later moved to Jira: 

      https://issues.redhat.com/browse/DFBUGS-642

      Since it was verified that MCG no longer produces duplicate metrics (per my comment in the BZ),] it was decided to open a new bug against a different ODF component.

       

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      ---------------------------------------------------------------------------------------

      Reproduced on IBM Cloud

       

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      ---------------------------------------------------------------------------------------

      Reproduced using Internal deployment type

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      ---------------------------------------------------------------------------------------

      OCP: 4.18.0-0.nightly-2024-11-09-194014
      ODF: 4.18.0-47
      ceph: 19.2.0-47.el9cp (123a317ae596caa7f6d087fc76fffb6a736e0b5f) squid (stable)
      rook: v4.18.0-0.c455c6812cde09cfa9bc5bb03f3454db0208e263

       

      Is there any workaround available to the best of your knowledge?

      ---------------------------------------------------------------------------------------

      No

       

      Can this issue be reproduced? If so, please provide the hit rate

      ---------------------------------------------------------------------------------------

      Yes - so far I have reproduced it in both IBM Cloud and vSphere platforms, both on ODF 4.17 and 4.18

       

      Steps to Reproduce:

      ---------------------------------------------------------------------------------------

      1. In the description

       

      The exact date and time when the issue was observed, including timezone details:

      ---------------------------------------------------------------------------------------

      November 12th, 2024, 13:34

       

      Logs collected and log location:

      ---------------------------------------------------------------------------------------

      must-gather logs: https://drive.google.com/drive/folders/1IWscbqRNfWd-jhaPmxptkIXvA2wiZnIo?usp=drive_link

       

              dkamboj@redhat.com Divyansh Kamboj
              rh-ee-shirshfe Sagi Hirshfeld
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

                Created:
                Updated: