[OCPBUGS-35892] monitoring ClusterOperator should not blip Available=Unknown on client rate limiter

Type: Bug
Resolution: Duplicate
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.17
Component/s: Monitoring
Labels:
None

Severity:
Moderate
Regression:
No
Sprint:
MON Sprint 256, MON Sprint 257
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
NA
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem

I bumped into monitoring going Available!=True in 4.17 CI, and saw it linking the shipped-in-4.15 ~~OCPBUGS-23745~~ exception. I've cloned this bug to report the issues we're still seeing. All of the 4.17 hits over the past 96 hours have Available going Unknown on UpdatingPrometheusFailed and mentioning client rate limiter.

$ curl -s 'https://search.dptools.openshift.org/search?maxAge=96h&type=junit&context=0&name=periodic-.*4.17&search=clusteroperator/monitoring+condition/Available.*status/%5BUF%5D' | jq -r 'to_entries[].value | to_entries[].value[].context[]'
Jun 18 16:55:17.636 - 2266s E clusteroperator/monitoring condition/Available reason/Unknown status/False Unknown (exception: We are not worried about Available=False or Degraded=True blips for stable-system tests yet.)
Jun 17 12:17:22.197 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
Jun 17 12:13:51.618 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
Jun 17 12:20:39.844 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
Jun 17 23:45:45.828 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
Jun 18 18:41:31.933 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
Jun 20 02:56:45.777 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline (exception: https://issues.redhat.com/browse/OCPBUGS-23745)

Version-Release number of selected component

I'm only looking at dev/4.17, for any forward-looking improvements.

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=96h&type=junit&search=clusteroperator/monitoring+should+not+change+condition/Available' | grep '^periodic-.*4[.]17.*failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 25 runs, 20% failed, 20% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade (all) - 120 runs, 28% failed, 18% of failures match = 5% impact

How reproducible

5% impact in periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade is the current largest impact percentage on the most-frequently-run matching job.

Steps to Reproduce

Run periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade or another job with a combination of high-ish impact percentage and high run counts, watching the monitoring ClusterOperator's Available condition.

Actual results

Blips of Available=Unknown that resolve more quickly than a responding admin could be expected to show up.

Expected results

Only going Available!=True when it seems reasonable to summon an emergency admin response.

Additional context

The matching results from my CI Search query above are these runs:

$ curl -s 'https://search.dptools.openshift.org/search?maxAge=96h&type=junit&context=0&name=periodic-.*4.17&search=clusteroperator/monitoring+condition/Available.*status/%5BUF%5D' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-upgrade-aws-ovn-arm64/1803081477055844352
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645333503643648
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645338536808448
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802819236817539072
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1803104183897821184
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1803592572216545280

clones

OCPBUGS-23745 monitoring ClusterOperator should not blip Available=False on quick etcd leader changes

Closed

relates to

OTA-362 CI: fail update suite if any ClusterOperator go Available=False

Closed

OCPBUGS-17506 Make ValidatePrometheus status more accurate and its logs clearer

Closed

W. Trevor King added a comment - 2025/01/03 3:57 AM

Coming back around to this, and running my original query from the Description again, but this time for 4.18 instead of 4.17:

$ curl -s 'https://search.dptools.openshift.org/search?maxAge=96h&type=junit&context=0&name=periodic-.*4.18&search=clusteroperator/monitoring+condition/Available.*status/%5BUF%5D' | jq -r 'to_entries[] | .key + "\n" + (.value | to_entries[].value[].context | join("\n"))' | grep -v 'expected availability transition during upgrade'
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-metal-ovn-single-node-rt-upgrade-test/1874258998442594304
Jan 01 03:32:40.820 - 913s  E clusteroperator/monitoring condition/Available reason/MultipleTasksFailed status/False UpdatingAlertmanager: reconciling Alertmanager Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: deleting Thanos Ruler Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io thanos-ruler), UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingUserWorkloadPrometheus: deleting UserWorkload federate Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io federate), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s)
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade-test/1874214154449981440
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-aws-upgrade-ovn-single-node/1873572334493765632
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade/1874214348725948416

So only single-node, and it looks like there's an exception in place in the 4.19 origin suite about expected availability transition during upgrade issues on single-node going forward. Thanks!

W. Trevor King added a comment - 2025/01/03 3:57 AM Coming back around to this, and running my original query from the Description again, but this time for 4.18 instead of 4.17: $ curl -s 'https://search.dptools.openshift.org/search?maxAge=96h&type=junit&context=0&name=periodic-.*4.18&search=clusteroperator/monitoring+condition/Available.*status/%5BUF%5D' | jq -r 'to_entries[] | .key + "\n" + (.value | to_entries[].value[].context | join("\n"))' | grep -v 'expected availability transition during upgrade' https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-metal-ovn-single-node-rt-upgrade-test/1874258998442594304 Jan 01 03:32:40.820 - 913s E clusteroperator/monitoring condition/Available reason/MultipleTasksFailed status/False UpdatingAlertmanager: reconciling Alertmanager Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: deleting Thanos Ruler Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io thanos-ruler), UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingUserWorkloadPrometheus: deleting UserWorkload federate Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io federate), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s) https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade-test/1874214154449981440 https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-aws-upgrade-ovn-single-node/1873572334493765632 https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade/1874214348725948416 So only single-node, and it looks like there's an exception in place in the 4.19 origin suite about expected availability transition during upgrade issues on single-node going forward. Thanks!

Ayoub Mrini added a comment - 2024/10/29 10:20 AM

trking, I'll go ahead and close this ticket based on my latest update. However, I know you won't hesitate to reopen it or create a new one if there's anything else we can improve on.

Ayoub Mrini added a comment - 2024/10/29 10:20 AM trking , I'll go ahead and close this ticket based on my latest update. However, I know you won't hesitate to reopen it or create a new one if there's anything else we can improve on.

Ayoub Mrini added a comment - 2024/08/09 3:44 PM - edited

FYI trking
A fix for https://issues.redhat.com/browse/OCPBUGS-17506 was recently merged (for 4.17, we aren't consider backporting for now)
We should see more accurate error messages now instead of "client rate limiter Wait returned an error: context deadline exceeded"
and also, more accurate statuses (available/degraded) will be set.

Ayoub Mrini added a comment - 2024/08/09 3:44 PM - edited FYI trking A fix for https://issues.redhat.com/browse/OCPBUGS-17506 was recently merged (for 4.17, we aren't consider backporting for now) We should see more accurate error messages now instead of "client rate limiter Wait returned an error: context deadline exceeded" and also, more accurate statuses (available/degraded) will be set.

W. Trevor King added a comment - 2024/07/11 11:55 PM

I'm happy to take another look at CI Search results after ~~OCPBUGS-17506~~ has been fixed for a while, once we get to that point, and we can measure instead of guessing

W. Trevor King added a comment - 2024/07/11 11:55 PM I'm happy to take another look at CI Search results after OCPBUGS-17506 has been fixed for a while, once we get to that point, and we can measure instead of guessing

Ayoub Mrini added a comment - 2024/07/11 8:09 AM - edited

> But during a healthy roll like this, I don't think we should even be Degraded=True. E.g. if at t0 Prom-0 is down but Prom-1 is up, and at t5m Prom-0 is up but Prom-1 is down, and at t10m Prom-0 is up but Prom-1 is still down, and at t15m both Prom are happy again, that's just a healthy roll, working as expected, and not unexpected degradation or some quality-of-service violation we'd expect to improve on.

I was assuming that at t15m, only one of the two Prom Pods was still available, that's why I talked about avilable=True Degraded=True
Of course, if both of them are ready, CMO should yield avilable=True Degraded=False
CMO should already know how to distinguish between these two cases, see https://github.com/openshift/cluster-monitoring-operator/blob/5341b80d4868fb1a330b14c8beff4c3ebba01b69/pkg/client/client.go#L918-L967
But as discussed in https://issues.redhat.com/browse/OCPBUGS-17506, that util's retrying mechanism kind of swallows the more relevant findings and we should fix that.
I think fixing ~~OCPBUGS-17506~~, should also fix ~~OCPBUGS-35892~~.

Ayoub Mrini added a comment - 2024/07/11 8:09 AM - edited > But during a healthy roll like this, I don't think we should even be Degraded=True. E.g. if at t0 Prom-0 is down but Prom-1 is up, and at t5m Prom-0 is up but Prom-1 is down, and at t10m Prom-0 is up but Prom-1 is still down, and at t15m both Prom are happy again, that's just a healthy roll, working as expected, and not unexpected degradation or some quality-of-service violation we'd expect to improve on. I was assuming that at t15m, only one of the two Prom Pods was still available, that's why I talked about avilable=True Degraded=True Of course, if both of them are ready, CMO should yield avilable=True Degraded=False CMO should already know how to distinguish between these two cases, see https://github.com/openshift/cluster-monitoring-operator/blob/5341b80d4868fb1a330b14c8beff4c3ebba01b69/pkg/client/client.go#L918-L967 But as discussed in https://issues.redhat.com/browse/OCPBUGS-17506 , that util's retrying mechanism kind of swallows the more relevant findings and we should fix that. I think fixing OCPBUGS-17506 , should also fix OCPBUGS-35892 .

W. Trevor King added a comment - 2024/07/10 11:22 PM

UpdatingPrometheus should have returned Degraded=True,Available=True after the 3*5minutes retries...

But during a healthy roll like this, I don't think we should even be Degraded=True. E.g. if at t0 Prom-0 is down but Prom-1 is up, and at t5m Prom-0 is up but Prom-1 is down, and at t10m Prom-0 is up but Prom-1 is still down, and at t15m both Prom are happy again, that's just a healthy roll, working as expected, and not unexpected degradation or some quality-of-service violation we'd expect to improve on.

I have no problems with you working on ~~OCPBUGS-17506~~ first, but I think the ask here is to either teach the operator a clearer picture of what a healthy compute-reboot looks like, or to make the naive duration window so long that more healthy compute-reboots fit inside it (with an "is there at least one healthy replica?" check tied to Available to cover the cases where you're actually not available, despite the Degraded long-naive-grace-period still cooking).

W. Trevor King added a comment - 2024/07/10 11:22 PM UpdatingPrometheus should have returned Degraded=True,Available=True after the 3*5minutes retries... But during a healthy roll like this, I don't think we should even be Degraded=True . E.g. if at t0 Prom-0 is down but Prom-1 is up, and at t5m Prom-0 is up but Prom-1 is down, and at t10m Prom-0 is up but Prom-1 is still down, and at t15m both Prom are happy again, that's just a healthy roll, working as expected, and not unexpected degradation or some quality-of-service violation we'd expect to improve on. I have no problems with you working on OCPBUGS-17506 first, but I think the ask here is to either teach the operator a clearer picture of what a healthy compute-reboot looks like, or to make the naive duration window so long that more healthy compute-reboots fit inside it (with an "is there at least one healthy replica?" check tied to Available to cover the cases where you're actually not available, despite the Degraded long-naive-grace-period still cooking).

Ayoub Mrini added a comment - 2024/07/09 6:49 PM

> So from 12:01:59 through 12:18:30 we didn't know how UpdatingPrometheus was going, but only 12:17:22 through 12:18:30 has that reported as Available=Unknown? If I'm reading that right, I agree, we probably don't want to blindly extend the grace period further.

Yes, during the 3*5minutes retries, the CMO status should stay vailable=True Degraded=False.

> So maybe the ask for this bug is to expand the UpdatingPrometheus logic to be more nuanced about how it considers Prometheus availability? Because one member rolling in minutes as its host node drains and reboots, followed by the other member doing the same, sounds like it should be Available=True Degraded=False the whole time, as the system happily rides that healthy-churn wave. The current operator logic seems to expect the happy-state will include a period of stability every ~10m, and that seems like an overly-optimistic assumption during MachineConfigPool rolls or other healthy churn.

Yes, that's what I was trying to say with the following part of my last comment:

That being said, I agree that "client rate limiter" isn't the clearest and most helpful log and that Degraded=Unknown,Available=Unknown isn't the best/appropriate status in such cases, as I'm sure that during those tests, 1 out of the 2 Prometheus Pods was always available, it would indeed be more appropriate to report Degraded=True,Available=True. This is due to how we retry/poll, I started a discussion around this here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499

UpdatingPrometheus should have returned Degraded=True,Available=True after the 3*5minutes retries, but given our retrying utils and how they're configured, we end up with the unfortunate Degraded=Unknown,Available=Unknown, I explained that more in details here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499

If you agree with me, I think we can close this in favor of https://issues.redhat.com/browse/OCPBUGS-17506

Ayoub Mrini added a comment - 2024/07/09 6:49 PM > So from 12:01:59 through 12:18:30 we didn't know how UpdatingPrometheus was going, but only 12:17:22 through 12:18:30 has that reported as Available=Unknown? If I'm reading that right, I agree, we probably don't want to blindly extend the grace period further. Yes, during the 3*5minutes retries, the CMO status should stay vailable=True Degraded=False . > So maybe the ask for this bug is to expand the UpdatingPrometheus logic to be more nuanced about how it considers Prometheus availability? Because one member rolling in minutes as its host node drains and reboots, followed by the other member doing the same, sounds like it should be Available=True Degraded=False the whole time, as the system happily rides that healthy-churn wave. The current operator logic seems to expect the happy-state will include a period of stability every ~10m, and that seems like an overly-optimistic assumption during MachineConfigPool rolls or other healthy churn. Yes, that's what I was trying to say with the following part of my last comment: That being said, I agree that "client rate limiter" isn 't the clearest and most helpful log and that Degraded=Unknown,Available=Unknown isn' t the best/appropriate status in such cases, as I'm sure that during those tests, 1 out of the 2 Prometheus Pods was always available, it would indeed be more appropriate to report Degraded=True,Available=True. This is due to how we retry/poll, I started a discussion around this here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel #comment-25064499 UpdatingPrometheus should have returned Degraded=True,Available=True after the 3*5minutes retries, but given our retrying utils and how they're configured, we end up with the unfortunate Degraded=Unknown,Available=Unknown , I explained that more in details here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499 If you agree with me, I think we can close this in favor of https://issues.redhat.com/browse/OCPBUGS-17506

W. Trevor King added a comment - 2024/07/05 11:15 PM

As an aside, the Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 to %!s(*string=0xc0021446c8) message in my previous comment may have been addressed here.

W. Trevor King added a comment - 2024/07/05 11:15 PM As an aside, the Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 to %!s(*string=0xc0021446c8) message in my previous comment may have been addressed here .

W. Trevor King added a comment - 2024/07/05 10:54 PM

That run has:

: [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less
Run #0: Failed expand_less	1h54m55s
{  0 unexpected clusteroperator state transitions during e2e test run, as desired.
3 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:

Jun 17 12:17:22.197 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
Jun 17 12:17:22.197 - 68s   E clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
Jun 17 12:18:30.986 W clusteroperator/monitoring condition/Available reason/RollOutDone status/True Successfully rolled out the stack. (exception: Available=True is the happy case)

Trying to square that 68s of Available=Unknown with 3 tries at 5m intervals:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-588cfbcf76-kxmtr_cluster-monitoring-operator.log | grep 'task 2 of 16' | grep -B8 12:18:
30
I0617 12:01:55.722279       1 tasks.go:76] ran task 2 of 16: UpdatingPrometheus
I0617 12:01:59.099350       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
W0617 12:07:04.776247       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline]
I0617 12:07:08.052354       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
W0617 12:12:13.524429       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded]
I0617 12:12:16.743621       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
W0617 12:17:22.163548       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded]
I0617 12:17:25.431595       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
I0617 12:18:30.907658       1 tasks.go:76] ran task 2 of 16: UpdatingPrometheus

So from 12:01:59 through 12:18:30 we didn't know how UpdatingPrometheus was going, but only 12:17:22 through 12:18:30 has that reported as Available=Unknown? If I'm reading that right, I agree, we probably don't want to blindly extend the grace period further.

But here it looks like the issue was a healthy roll:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-monitoring" and (.involvedObject.name | startswith("prometheus-"))) | (.firstTimestamp // .eventTime // .metadate.creationTimestamp) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[]' | grep -v prometheus-operator | grep 'Stopping contai
ner prometheus\|Unhealthy\|Scheduled\|Started container prometheus' | grep 2024-06-17T12
2024-06-17T12:02:04Z Pod prometheus-k8s-1 Killing: Stopping container prometheus
2024-06-17T12:02:07Z Pod prometheus-k8s-1 Unhealthy: Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of d20acbd8e7a46b39fe1de8a02ef60429fa834b2a070625eddaf570b07f8b16c0 is running failed: container process not found
2024-06-17T12:09:40.640115Z Pod prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-j87ydxij-0e0a9-tjf7n-worker-b-6hnmr
2024-06-17T12:10:56Z Pod prometheus-k8s-1 Started: Started container prometheus
2024-06-17T12:11:58Z Pod prometheus-k8s-0 Killing: Stopping container prometheus
2024-06-17T12:12:00Z Pod prometheus-k8s-0 Unhealthy: Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 9979245dfa2f422a3cbfd366b53e939a463e805f822af0429c8bbfd582ea91f1 is running failed: container process not found
2024-06-17T12:17:45Z Pod prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6
2024-06-17T12:17:55Z Pod prometheus-k8s-0 Started: Started container prometheus

as the machine-config operator worked through the compute MachineConfigPool:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .involvedObject.kind == "MachineConfigPool" and .involvedObject.name == "worker") | .firstTimestamp + " " + .reason + ": " + .message] | sort[]' | tail -n4
2024-06-17T11:54:27Z RenderedConfigGenerated: rendered-worker-2f57e208991f934c42f4ae2b90157db9 successfully generated (release version: 4.17.0-0.nightly-2024-06-17-100744, controller version: 58d31f062a03680185f1c44f63decb77c7f1fabc)
2024-06-17T11:54:34Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-a-m4nhj to %!s(*string=0xc001823748)
2024-06-17T12:01:57Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-b-6hnmr to %!s(*string=0xc00293fa08)
2024-06-17T12:09:50Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 to %!s(*string=0xc0021446c8)

So maybe the ask for this bug is to expand the UpdatingPrometheus logic to be more nuanced about how it considers Prometheus availability? Because one member rolling in minutes as its host node drains and reboots, followed by the other member doing the same, sounds like it should be Available=True Degraded=False the whole time, as the system happily rides that healthy-churn wave. The current operator logic seems to expect the happy-state will include a period of stability every ~10m, and that seems like an overly-optimistic assumption during MachineConfigPool rolls or other healthy churn.

W. Trevor King added a comment - 2024/07/05 10:54 PM That run has: : [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less Run #0: Failed expand_less 1h54m55s { 0 unexpected clusteroperator state transitions during e2e test run, as desired. 3 unwelcome but acceptable clusteroperator state transitions during e2e test run. These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail: Jun 17 12:17:22.197 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745) Jun 17 12:17:22.197 - 68s E clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745) Jun 17 12:18:30.986 W clusteroperator/monitoring condition/Available reason/RollOutDone status/True Successfully rolled out the stack. (exception: Available=True is the happy case) Trying to square that 68s of Available=Unknown with 3 tries at 5m intervals : $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-588cfbcf76-kxmtr_cluster-monitoring-operator.log | grep 'task 2 of 16' | grep -B8 12:18: 30 I0617 12:01:55.722279 1 tasks.go:76] ran task 2 of 16: UpdatingPrometheus I0617 12:01:59.099350 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus W0617 12:07:04.776247 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline] I0617 12:07:08.052354 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus W0617 12:12:13.524429 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded] I0617 12:12:16.743621 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus W0617 12:17:22.163548 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded] I0617 12:17:25.431595 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus I0617 12:18:30.907658 1 tasks.go:76] ran task 2 of 16: UpdatingPrometheus So from 12:01:59 through 12:18:30 we didn't know how UpdatingPrometheus was going, but only 12:17:22 through 12:18:30 has that reported as Available=Unknown ? If I'm reading that right, I agree, we probably don't want to blindly extend the grace period further. But here it looks like the issue was a healthy roll: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-monitoring" and (.involvedObject.name | startswith("prometheus-"))) | (.firstTimestamp // .eventTime // .metadate.creationTimestamp) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[]' | grep -v prometheus-operator | grep 'Stopping contai ner prometheus\|Unhealthy\|Scheduled\|Started container prometheus' | grep 2024-06-17T12 2024-06-17T12:02:04Z Pod prometheus-k8s-1 Killing: Stopping container prometheus 2024-06-17T12:02:07Z Pod prometheus-k8s-1 Unhealthy: Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of d20acbd8e7a46b39fe1de8a02ef60429fa834b2a070625eddaf570b07f8b16c0 is running failed: container process not found 2024-06-17T12:09:40.640115Z Pod prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-j87ydxij-0e0a9-tjf7n-worker-b-6hnmr 2024-06-17T12:10:56Z Pod prometheus-k8s-1 Started: Started container prometheus 2024-06-17T12:11:58Z Pod prometheus-k8s-0 Killing: Stopping container prometheus 2024-06-17T12:12:00Z Pod prometheus-k8s-0 Unhealthy: Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 9979245dfa2f422a3cbfd366b53e939a463e805f822af0429c8bbfd582ea91f1 is running failed: container process not found 2024-06-17T12:17:45Z Pod prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 2024-06-17T12:17:55Z Pod prometheus-k8s-0 Started: Started container prometheus as the machine-config operator worked through the compute MachineConfigPool: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .involvedObject.kind == "MachineConfigPool" and .involvedObject.name == "worker") | .firstTimestamp + " " + .reason + ": " + .message] | sort[]' | tail -n4 2024-06-17T11:54:27Z RenderedConfigGenerated: rendered-worker-2f57e208991f934c42f4ae2b90157db9 successfully generated (release version: 4.17.0-0.nightly-2024-06-17-100744, controller version: 58d31f062a03680185f1c44f63decb77c7f1fabc) 2024-06-17T11:54:34Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-a-m4nhj to %!s(*string=0xc001823748) 2024-06-17T12:01:57Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-b-6hnmr to %!s(*string=0xc00293fa08) 2024-06-17T12:09:50Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 to %!s(*string=0xc0021446c8) So maybe the ask for this bug is to expand the UpdatingPrometheus logic to be more nuanced about how it considers Prometheus availability? Because one member rolling in minutes as its host node drains and reboots, followed by the other member doing the same, sounds like it should be Available=True Degraded=False the whole time, as the system happily rides that healthy-churn wave. The current operator logic seems to expect the happy-state will include a period of stability every ~10m, and that seems like an overly-optimistic assumption during MachineConfigPool rolls or other healthy churn.

Ayoub Mrini added a comment - 2024/07/04 3:44 PM - edited

Thanks for the report Trevor,

Taking a look at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752
And at CMO's logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-588cfbcf76-kxmtr_cluster-monitoring-operator.log

I see:

...
I0617 12:01:59.099350       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
...
W0617 12:07:04.776247       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline]
I0617 12:07:04.776391       1 operator.go:887] 1 reconciliation(s) failed, 2 more attempt(s) will be made before reporting failures.
E0617 12:07:04.776407       1 operator.go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0617 12:07:04.776435       1 operator.go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed)
I0617 12:07:04.833211       1 operator.go:842] Updating ClusterOperator status to InProgress.
...
W0617 12:12:13.524429       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded]
I0617 12:12:13.524548       1 operator.go:887] 2 reconciliation(s) failed, 1 more attempt(s) will be made before reporting failures.
E0617 12:12:13.524565       1 operator.go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0617 12:12:13.524580       1 operator.go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed)
I0617 12:12:13.561518       1 operator.go:842] Updating ClusterOperator status to InProgress.
...
W0617 12:17:22.163548       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded]
I0617 12:17:22.163618       1 operator.go:890] 3 reconciliations failed in a row, the threshold of 3 attempts has been reached, failures will be reported.
E0617 12:17:22.181721       1 operator.go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0617 12:17:22.181754       1 operator.go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed)
...

As expected, CMO had tried to sync prometheus 3 successive times before reporting failures.
And each time it waited for 5 minutes.

I don't see any unjustified or unexpected status changes. Maybe I'm missing something.
And I don't think we should make CMO more lenient (wait for more than 3*5minutes before reporting a failure), do you think we should?

That being said, I agree that "client rate limiter" isn't the clearest and most helpful log and that Degraded=Unknown,Available=Unknown isn't the best/appropriate status in such cases, as I'm sure that during those tests, 1 out of the 2 Prometheus Pods was always available, it would indeed be more appropriate to report Degraded=True,Available=True. This is due to how we retry/poll, I started a discussion around this here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499

—

"Only going Available!=True when it seems reasonable to summon an emergency admin response."
I didn't know that Available=Unknown requires an "emergency admin response" I thought only an explicit Available=True did.

Ayoub Mrini added a comment - 2024/07/04 3:44 PM - edited Thanks for the report Trevor, Taking a look at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752 And at CMO's logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-588cfbcf76-kxmtr_cluster-monitoring-operator.log I see: ... I0617 12:01:59.099350 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus ... W0617 12:07:04.776247 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline] I0617 12:07:04.776391 1 operator .go:887] 1 reconciliation(s) failed, 2 more attempt(s) will be made before reporting failures. E0617 12:07:04.776407 1 operator .go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed E0617 12:07:04.776435 1 operator .go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed) I0617 12:07:04.833211 1 operator .go:842] Updating ClusterOperator status to InProgress. ... W0617 12:12:13.524429 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded] I0617 12:12:13.524548 1 operator .go:887] 2 reconciliation(s) failed, 1 more attempt(s) will be made before reporting failures. E0617 12:12:13.524565 1 operator .go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed E0617 12:12:13.524580 1 operator .go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed) I0617 12:12:13.561518 1 operator .go:842] Updating ClusterOperator status to InProgress. ... W0617 12:17:22.163548 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded] I0617 12:17:22.163618 1 operator .go:890] 3 reconciliations failed in a row, the threshold of 3 attempts has been reached, failures will be reported. E0617 12:17:22.181721 1 operator .go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed E0617 12:17:22.181754 1 operator .go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed) ... As expected, CMO had tried to sync prometheus 3 successive times before reporting failures. And each time it waited for 5 minutes. I don't see any unjustified or unexpected status changes. Maybe I'm missing something. And I don't think we should make CMO more lenient (wait for more than 3*5minutes before reporting a failure), do you think we should? That being said, I agree that "client rate limiter" isn't the clearest and most helpful log and that Degraded=Unknown,Available=Unknown isn't the best/appropriate status in such cases, as I'm sure that during those tests, 1 out of the 2 Prometheus Pods was always available, it would indeed be more appropriate to report Degraded=True,Available=True. This is due to how we retry/poll, I started a discussion around this here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499 — "Only going Available!=True when it seems reasonable to summon an emergency admin response." I didn't know that Available=Unknown requires an "emergency admin response" I thought only an explicit Available=True did.

Assignee:: Ayoub Mrini

Reporter:: W. Trevor King

QA Contact:: Junqi Zhao

Doc Contact:: Brian Burt

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/06/20 8:37 PM

Updated:: 2025/01/03 3:57 AM

Resolved:: 2024/10/29 10:20 AM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional context

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: W. Trevor King added a comment - 2025/01/03 3:57 AM

Expand comment: W. Trevor King added a comment - 2025/01/03 3:57 AM

Collapse comment: Ayoub Mrini added a comment - 2024/10/29 10:20 AM

Expand comment: Ayoub Mrini added a comment - 2024/10/29 10:20 AM

Collapse comment: Ayoub Mrini added a comment - 2024/08/09 3:44 PM, Edited by Ayoub Mrini - 2024/08/09 3:45 PM

Expand comment: Ayoub Mrini added a comment - 2024/08/09 3:44 PM, Edited by Ayoub Mrini - 2024/08/09 3:45 PM

Collapse comment: W. Trevor King added a comment - 2024/07/11 11:55 PM

Expand comment: W. Trevor King added a comment - 2024/07/11 11:55 PM

Collapse comment: Ayoub Mrini added a comment - 2024/07/11 8:09 AM, Edited by Ayoub Mrini - 2024/07/11 8:10 AM

Expand comment: Ayoub Mrini added a comment - 2024/07/11 8:09 AM, Edited by Ayoub Mrini - 2024/07/11 8:10 AM

Collapse comment: W. Trevor King added a comment - 2024/07/10 11:22 PM

Expand comment: W. Trevor King added a comment - 2024/07/10 11:22 PM

Collapse comment: Ayoub Mrini added a comment - 2024/07/09 6:49 PM

Expand comment: Ayoub Mrini added a comment - 2024/07/09 6:49 PM

Collapse comment: W. Trevor King added a comment - 2024/07/05 11:15 PM

Expand comment: W. Trevor King added a comment - 2024/07/05 11:15 PM

Collapse comment: W. Trevor King added a comment - 2024/07/05 10:54 PM

Expand comment: W. Trevor King added a comment - 2024/07/05 10:54 PM

Collapse comment: Ayoub Mrini added a comment - 2024/07/04 3:44 PM, Edited by Ayoub Mrini - 2024/07/04 3:45 PM

Expand comment: Ayoub Mrini added a comment - 2024/07/04 3:44 PM, Edited by Ayoub Mrini - 2024/07/04 3:45 PM

People

Dates