Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17506

Make ValidatePrometheus status more accurate and its logs clearer

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • None
    • 4.14
    • Monitoring
    • None
    • Moderate
    • No
    • MON Sprint 256, MON Sprint 257
    • 2
    • False
    • Hide

      None

      Show
      None
    • NA
    • Release Note Not Required
    • In Progress

      Description of problem:

      While debugging https://docs.google.com/document/d/10kcIQPsn2H_mz7dJx3lbZR2HivjnC_FAnlt2adc53TY/edit#heading=h.egy1agkrq2v1, we came across the log:
      
      2023-07-31T16:51:50.240749863Z W0731 16:51:50.240586       1 tasks.go:72] task 3 of 15: Updating Prometheus-k8s failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline]
      
      After some searching, we understood that the log is trying to say that ValidatePrometheus timed out waiting for prometheus to become ready.
      
      The 

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      See here https://redhat-internal.slack.com/archives/C02BQCCFZPX/p1690892059971129?thread_ts=1690873617.023399&cid=C02BQCCFZPX for how to get the function time out. 

      Actual results:

       

      Expected results:

      - Clearer logs.
      
      - Some info that we are logging makes more sense to be part of the error, example: https://github.com/openshift/cluster-monitoring-operator/blob/af831de434ce13b3edc0260a468064e0f3200044/pkg/client/client.go#L890
      
      - Make info as "unavailable (unknown):" clearer as we cannot understand want it means without referring to code.

      Additional info:

      - Do the same for the other functions that wait for other components if using the same wait mechanism (PollUntilContextTimeout...)
      
      - https://redhat-internal.slack.com/archives/C02BQCCFZPX/p1690873617023399 for more details.
      
      see https://redhat-internal.slack.com/archives/C0VMT03S5/p1691069196066359?thread_ts=1690827144.818209&cid=C0VMT03S5 for the slack discussion.
      
      

              spasquie@redhat.com Simon Pasquier
              rh-ee-amrini Ayoub Mrini
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: