Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-610

Better message in the CMO degraded/unavailable conditions when alertmanager can't be scheduled

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Normal
    • None
    • 4.12
    • Monitoring
    • None
    • Moderate
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      Better message in the CMO degraded/unavailable conditions when alertmanager pods can't be scheduled 
      

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-2022-08-24-053339
      

      How reproducible:

      always
      
      

      Steps to Reproduce:

      1. Configure alertmanager with an invalid volume claim template (unknown storage class for instance):
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: cluster-monitoring-config
        namespace: openshift-monitoring
      data:
        config.yaml: |
          alertmanagerMain:
            volumeClaimTemplate:
              metadata:
                name: monitorpvc
              spec:
                storageClassName: foo
                volumeMode: Filesystem
                resources:
                  requests:
                    storage: 1Gi
      2. Wait for CMO to go degraded
      

      Actual results:

      % oc -n openshift-monitoring describe pod alertmanager-main-0 |tail -n 10
      QoS Class:                   Burstable
      Node-Selectors:              kubernetes.io/os=linux
      Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
        Type     Reason            Age   From               Message
        ----     ------            ----  ----               -------
        Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
        Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
      % oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
      [
        {
          "lastTransitionTime": "2022-08-25T08:51:55Z",
          "reason": "AsExpected",
          "status": "True",
          "type": "Available"
        },
        {
          "lastTransitionTime": "2022-08-25T08:51:55Z",
          "message": "waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas",
          "reason": "UpdatingAlertmanagerFailed",
          "status": "True",
          "type": "Degraded"
        }
      ]
      
      

      Expected results:

      CMO should surface a better explanation as to why the pods aren't in the desired state. The available status should be false
      
      

      Additional info:

      The fix of the following bug makes co monitoring reflect status of prometheus pos/prometheus operator now, the co monitoring failed to reflect status of alermanager pod
      https://bugzilla.redhat.com/show_bug.cgi?id=2043518
      

      Attachments

        Activity

          People

            sthaha@redhat.com Sunil Thaha
            hongyli@redhat.com Hongyan Li
            Hongyan Li Hongyan Li
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: