Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.16.0
Affects Version/s: 4.15
Component/s: Monitoring
Labels:
None

Severity:
Moderate
Regression:
No
Sprint:
MON Sprint 246, MON Sprint 247
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The monitoring operator may be down or disabled, and the components it manages may be unavailable or degraded.
Upon quick check I've noticed an error:

oc get co -o json | jq -r '.items[].status | select (.conditions) '.conditions | jq -r '.[] | select( (.type == "Degraded") and (.status == "True") )'

    {
      "lastTransitionTime": "2023-12-19T10:25:24Z",
      "message": "syncing Thanos Querier trusted CA bundle ConfigMap failed: reconciling trusted CA bundle ConfigMap failed: updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded, syncing Thanos Querier trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=alertmanager,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps), syncing Prometheus trusted CA bundle ConfigMap failed: deleting old trusted CA bundle configmaps failed: error listing configmaps in namespace openshift-monitoring with label selector monitoring.openshift.io/name=prometheus,monitoring.openshift.io/hash!=2ua4n9ob5qr8o: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps)",
      "reason": "MultipleTasksFailed",
      "status": "True",
      "type": "Degraded"
    }

i.e. updating ConfigMap object failed: Timeout: request did not complete within requested timeout - context deadline exceeded

I ran oc get co again and everything looked fine, it seems this timeout condition could be handled better to avoid alerting SRE.

Actual results:

operator degraded

Expected results:

operator retries operation

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

must-gather.tar.gz
1.23 MB
2023/12/19 2:26 PM

links to

openshift/cluster-monitoring-operator#2219: OCPBUGS-25676: fix(tasks): adjust 'trusted CA bundle ConfigMap' related logs for alertmanagers.

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Ayoub Mrini

Reporter:: Tomas Dabasinskas

QA Contact:: Tai Gao

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/12/19 10:50 AM

Updated:: 2024/06/27 11:28 AM

Resolved:: 2024/06/27 11:28 AM

Details

Description

Description of problem:

Actual results:

Expected results:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates