Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Monitoring
Labels:
None

Severity:
Moderate
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description

Description of problem:

Better message in the CMO degraded/unavailable conditions when alertmanager pods can't be scheduled

Version-Release number of selected component (if applicable):

4.12.0-0.nightly-2022-08-24-053339

How reproducible:

always

Steps to Reproduce:

1. Configure alertmanager with an invalid volume claim template (unknown storage class for instance):
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    alertmanagerMain:
      volumeClaimTemplate:
        metadata:
          name: monitorpvc
        spec:
          storageClassName: foo
          volumeMode: Filesystem
          resources:
            requests:
              storage: 1Gi
2. Wait for CMO to go degraded

Actual results:

% oc -n openshift-monitoring describe pod alertmanager-main-0 |tail -n 10
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  13m   default-scheduler  0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
% oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded" or .type=="Available"))'
[
  {
    "lastTransitionTime": "2022-08-25T08:51:55Z",
    "reason": "AsExpected",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-25T08:51:55Z",
    "message": "waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas",
    "reason": "UpdatingAlertmanagerFailed",
    "status": "True",
    "type": "Degraded"
  }
]

Expected results:

CMO should surface a better explanation as to why the pods aren't in the desired state. The available status should be false

Additional info:

The fix of the following bug makes co monitoring reflect status of prometheus pos/prometheus operator now, the co monitoring failed to reflect status of alermanager pod
https://bugzilla.redhat.com/show_bug.cgi?id=2043518

Attachments

Activity

People

Assignee:: Sunil Thaha

Reporter:: Hongyan Li

QA Contact:: Hongyan Li

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2022/08/26 8:04 AM

Updated:: 2023/03/10 8:08 PM

Resolved:: 2023/03/10 8:08 PM