Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48747

openshift-state-metrics generates duplicate "openshift_route_status" metrics

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • 4.20.0
    • 4.16.z
    • Monitoring
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • MON Sprint 270, MON Sprint 271
    • 2
    • Done
    • Release Note Not Required
    • None
    • None
    • None

      Description of problem:

      Customer upgraded from OpenShift Container Platform 4.15 to OpenShift Container Platform 4.16.28 and is now seeing "PrometheusDuplicateTimestamp" alerts.
      
      Specifically, the following ServiceMonitor is showing duplicate metrics: 
      
      ts=2025-01-21T00:10:33.052Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.125.8.88:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=16
      
      When manually querying the endpoint, we can see that the metric "openshift_route_status" is reported multiple times (in the output below, all have the same value, but it seems that sometimes different values are present):
      
      ~~~
      $ curl --cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt --key /etc/prometheus/secrets/metrics-client-certs/tls.key --cert /etc/prometheus/secrets/metrics-client-certs/tls.crt -k https://10.128.2.7:8443/metrics | sort | uniq -c | sort -n 
      
      [..]
            2 openshift_route_status{namespace="argoant",route="florida-gateway-florida-gateway",status="True",type="Admitted",host="blue.example.com",router_name="default"} 1
            2 openshift_route_status{namespace="blue-staging",route="node",status="True",type="Admitted",host="node-blue-staging.example.com",router_name="default"} 1
            2 openshift_route_status{namespace="blue-staging",route="noding",status="True",type="Admitted",host="noding-blue-staging.example.com",router_name="default"} 1
            2 openshift_route_status{namespace="devspaces-operator",route="devworkspace-che-test-2d0cf27f",status="True",type="Admitted",host="devworkspace-che-test-2d0cf27f-devspaces-operator.example.com",router_name="default"} 1
            2 openshift_route_status{namespace="devspaces-operator",route="devworkspace-che-test-33ace4c4",status="True",type="Admitted",host="devworkspace-che-test-33ace4c4-devspaces-operator.example.com",router_name="default"} 1
      [..]
      ~~~

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.16.28

      How reproducible:

      On customer side

      Steps to Reproduce:

          1. Upgrade to OpenShift Container Platform 4.16.28
          2. Have multiple Routes in the cluster   

      Actual results:

      Observe that the PrometheusDuplicateTimestamp alert is firing due to duplicate metrics

      Expected results:

      No metrics are duplicated by "openshift-state-metrics"

      Additional info:

      - Prometheus logs available in Support Case 04037739
      - ServiceMonitor output available in Support Case 04037739
      - Full openshift-state-metrics "/metrics" output available in Support Case 04037739

              prasriva@redhat.com Pranshu Srivastava
              rhn-support-skrenger Simon Krenger
              None
              None
              Junqi Zhao Junqi Zhao
              Eliska Romanova Eliska Romanova
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: