Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: openshift-4.16.z, openshift-4.17.z
Component/s: Prometheus
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Intelligence Requested:
Market:

Severity:
Moderate

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

ISSUE 1.

The runbook "PrometheusDuplicateTimestamps" [0] contains the command:

$ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | \
grep 'Error on ingesting samples with different value but same timestamp.*' \
| sort | uniq -c | sort -n

And the command:

$ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | \
grep 'Duplicate sample for timestamp.*' | sort | uniq -c | sort -n

These commands don't return anything. Let's review a cluster were the problem is present:

$ oc logs prometheus-k8s-0 -n openshift-monitoring | grep -c 'Error on ingesting samples with different value but same timestamp.*' 
198
$ oc logs prometheus-k8s-1 -n openshift-monitoring | grep -c 'Error on ingesting samples with different value but same timestamp.*' 
200

Let's run now the command as it's in the runbook removing all after the "grep" and adding to the grep the option "-c"

$ NAMESPACE="openshift-monitoring" 
$ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | grep -c 'Error on ingesting samples with different value but same timestamp.*' 
0

The command as it's in the runbook returns 0 as the result of the command below returns always only 20 lines of logs that could contain or not the error:

$ oc -n openshift-monitoring logs -l app.kubernetes.io/name=prometheus |wc -l 
20
$ oc -n openshift-monitoring logs prometheus-k8s-0 |wc -l 
19019
$ oc -n openshift-monitoring logs prometheus-k8s-1 |wc -l 
19047

ISSUE 2

The same 2 commands cited contain "| sort | uniq -c | sort -n"

$ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | \
grep 'Error on ingesting samples with different value but same timestamp.*' \
| sort | uniq -c | sort -n


$ oc -n $NAMESPACE logs -l 'app.kubernetes.io/name=prometheus' | \
grep 'Duplicate sample for timestamp.*' | sort | uniq -c | sort -n

As when running the logs each entry returned contain the timestamp as observed below, each entry is unique, then " | sort | uniq -c | sort -n" is only using computational resources

        1 ts=2025-01-24T16:08:41.846Z caller=scrape.go:1783 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.128.2.9:8443/metrics msg="Duplicate sample for timestamp" series="openshift_group_user_account{group=\"cluster-admins\",user=\"admin\"}"
      1 ts=2025-01-24T16:10:41.846Z caller=scrape.go:1783 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.128.2.9:8443/metrics msg="Duplicate sample for timestamp" series="openshift_group_user_account{group=\"cluster-admins\",user=\"admin\"}"

Suggestion

It can be used a command like:

$ pods=$(oc -n $NAMESPACE get pods -l 'app.kubernetes.io/name=prometheus' -o jsonpath={.items[*].metadata.name})

$ for pod in $(echo $pods); do oc -n $NAMESPACE logs $pod; done | cut -c29-  | sort | uniq -c | sort -n     212 caller=scrape.go:1783 level=debug component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.128.2.9:8443/metrics msg="Duplicate sample for timestamp" series="openshift_group_user_account{group=\"cluster-admins\",user=\"admin\"}"

$ for pod in $(echo $pods); do oc -n $NAMESPACE logs $pod; done  | \
grep 'Error on ingesting samples with different value but same timestamp.*' | cut -c29- | sort | uniq -c | sort -n
    433 caller=scrape.go:1744 level=warn component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/openshift-state-metrics/0 target=https://10.128.2.9:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1

[0] https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/PrometheusDuplicateTimestamps.md

links to

[KCS] How to troubleshoot PrometheusDuplicateTimestamps alerts in RHOCP 4

Details

Description

ISSUE 1.

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates