Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36918

`PrometheusRemoteStorageFailures` alert failed to trigger

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • 4.16.z
    • 4.16
    • Monitoring
    • Moderate
    • No
    • MON Sprint 256, MON Sprint 257
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
       * Previously, the `PrometheusRemoteWriteBehind` alert was only triggered after Prometheus sent data to the `remote-write` endpoint on at least one occasion. With this release, the alert now also triggers if a connection could never be established with the endpoint, such as when an error exists with the endpoint URL from the time you added it to the `remote-write` endpoint configuration. (link:https://issues.redhat.com/browse/OCPBUGS-36918[*OCPBUGS-36918*])
      Show
       * Previously, the `PrometheusRemoteWriteBehind` alert was only triggered after Prometheus sent data to the `remote-write` endpoint on at least one occasion. With this release, the alert now also triggers if a connection could never be established with the endpoint, such as when an error exists with the endpoint URL from the time you added it to the `remote-write` endpoint configuration. (link: https://issues.redhat.com/browse/OCPBUGS-36918 [* OCPBUGS-36918 *])
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-35483. The following is the description of the original issue:

      Description of problem:

       

      Version-Release number of selected component (if applicable):

      4.16.0-0.nightly-2024-06-13-084629

      How reproducible:

      100%

      Steps to Reproduce:

      1.apply configmap
      *****
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: cluster-monitoring-config
        namespace: openshift-monitoring
      data:
        config.yaml: |
          prometheusK8s:
            remoteWrite:
              - url: "http://invalid-remote-storage.example.com:9090/api/v1/write"
                queue_config:
                  max_retries: 1
      *****
      
      2. check logs
      % oc logs -c prometheus prometheus-k8s-0 -n openshift-monitoring
      ...
      ts=2024-06-14T01:28:01.804Z caller=dedupe.go:112 component=remote level=warn remote_name=5ca657 url=http://invalid-remote-storage.example.com:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://invalid-remote-storage.example.com:9090/api/v1/write\": dial tcp: lookup invalid-remote-storage.example.com on 172.30.0.10:53: no such host"
      
      3.query after 15mins
      % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="PrometheusRemoteStorageFailures"}' | jq
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100   145  100    78  100    67    928    797 --:--:-- --:--:-- --:--:--  1726
      {
        "status": "success",
        "data": {
          "resultType": "vector",
          "result": [],
          "analysis": {}
        }
      }
      
      % oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=prometheus_remote_storage_failures_total' | jq
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100   124  100    78  100    46   1040    613 --:--:-- --:--:-- --:--:--  1653
      {
        "status": "success",
        "data": {
          "resultType": "vector",
          "result": [],
          "analysis": {}
        }
      }
      

      Actual results:

      alert did not triggeted

      Expected results:

      alert triggered, able to see the alert and metrics

      Additional info:

      below metrics show as `No datapoints found.`
      prometheus_remote_storage_failures_total
      prometheus_remote_storage_samples_dropped_total
      prometheus_remote_storage_retries_total
      `prometheus_remote_storage_samples_failed_total` value is 0

              rh-ee-amrini Ayoub Mrini
              openshift-crt-jira-prow OpenShift Prow Bot
              Tai Gao Tai Gao
              Eliska Romanova Eliska Romanova
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: