Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-4145

Runbook PrometheusDuplicateTimestamps commands

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • openshift-4.16.z, openshift-4.17.z
    • Prometheus
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Moderate

      ISSUE 1.

      The runbook "PrometheusDuplicateTimestamps" [0]  contains the command:

      $ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | \
      grep 'Error on ingesting samples with different value but same timestamp.*' \
      | sort | uniq -c | sort -n

      And the command:

      $ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | \
      grep 'Duplicate sample for timestamp.*' | sort | uniq -c | sort -n

      These commands don't return anything. Let's review a cluster were the problem is present:

      $ oc logs prometheus-k8s-0 -n openshift-monitoring | grep -c 'Error on ingesting samples with different value but same timestamp.*' 
      198
      $ oc logs prometheus-k8s-1 -n openshift-monitoring | grep -c 'Error on ingesting samples with different value but same timestamp.*' 
      200
      

      Let's run now the command as it's in the runbook removing all after the "grep" and adding to the grep the option "-c"

      $ NAMESPACE="openshift-monitoring" 
      $ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | grep -c 'Error on ingesting samples with different value but same timestamp.*' 
      0 

      The command as it's in the runbook returns 0 as the result of the command below returns always only 20 lines of logs that could contain or not the error:

      $ oc -n openshift-monitoring logs -l app.kubernetes.io/name=prometheus |wc -l 
      20
      $ oc -n openshift-monitoring logs prometheus-k8s-0 |wc -l 
      19019
      $ oc -n openshift-monitoring logs prometheus-k8s-1 |wc -l 
      19047

      ISSUE 2

      The same 2 commands cited contain "| sort | uniq -c | sort -n"

      $ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | \
      grep 'Error on ingesting samples with different value but same timestamp.*' \
      | sort | uniq -c | sort -n
      
      
      $ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | \
      grep 'Duplicate sample for timestamp.*' | sort | uniq -c | sort -n

      As when running the logs each entry returned contain the timestamp as observed below, each entry is unique, then " | sort | uniq -c | sort -n" is only using computational resources

              1 ts=2025-01-24T16:08:41.846Z caller=scrape.go:1783 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.128.2.9:8443/metrics msg="Duplicate sample for timestamp" series="openshift_group_user_account{group=\"cluster-admins\",user=\"admin\"}"
            1 ts=2025-01-24T16:10:41.846Z caller=scrape.go:1783 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.128.2.9:8443/metrics msg="Duplicate sample for timestamp" series="openshift_group_user_account{group=\"cluster-admins\",user=\"admin\"}"

       

      Suggestion

      It can be used a command like:

      $ pods=$(oc -n $NAMESPACE get pods -l 'app.kubernetes.io/name=prometheus' -o jsonpath={.items[*].metadata.name})
      
      $ for pod in $(echo $pods); do oc -n $NAMESPACE logs $pod; done | cut -c29-  | sort | uniq -c | sort -n     212 caller=scrape.go:1783 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.128.2.9:8443/metrics msg="Duplicate sample for timestamp" series="openshift_group_user_account{group=\"cluster-admins\",user=\"admin\"}"
      
      $ for pod in $(echo $pods); do oc -n $NAMESPACE logs $pod; done  | \
      grep 'Error on ingesting samples with different value but same timestamp.*' | cut -c29- | sort | uniq -c | sort -n
          433 caller=scrape.go:1744 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.128.2.9:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1
      
      

      [0] https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusDuplicateTimestamps.md

              Unassigned Unassigned
              rhn-support-ocasalsa Oscar Casal Sanchez
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: