Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25676

monitoring ClusterOperator should better handle timeouts

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Normal
    • 4.16.0
    • 4.15
    • Monitoring
    • None
    • Moderate
    • No
    • MON Sprint 246, MON Sprint 247
    • 2
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

       The monitoring operator may be down or disabled, and the components it manages may be unavailable or degraded.
      Upon quick check I've noticed an error:

      oc get co -o json | jq -r '.items[].status | select (.conditions) '.conditions | jq -r '.[] | select( (.type == "Degraded") and (.status == "True") )'
      
          {
            "lastTransitionTime": "2023-12-19T10:25:24Z",
            "message": "syncing Thanos Querier trusted CA bundle ConfigMap failed: reconciling trusted CA bundle ConfigMap failed: updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded, syncing Thanos Querier trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=alertmanager,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps), syncing Prometheus trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=prometheus,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)",
            "reason": "MultipleTasksFailed",
            "status": "True",
            "type": "Degraded"
          } 

      i.e. updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded

      I ran oc get co again and everything looked fine, it seems this timeout condition could be handled better to avoid alerting SRE.

      Actual results:

      operator degraded

      Expected results:

      operator retries operation

      Attachments

        Activity

          People

            rh-ee-amrini Ayoub Mrini
            todabasi.openshift Tomas Dabasinskas
            Tai Gao Tai Gao
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: