Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35892

monitoring ClusterOperator should not blip Available=Unknown on client rate limiter

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Normal Normal
    • None
    • 4.17
    • Monitoring
    • None
    • Moderate
    • No
    • MON Sprint 256, MON Sprint 257
    • 2
    • False
    • Hide

      None

      Show
      None
    • NA
    • Release Note Not Required
    • In Progress

      Description of problem

      I bumped into monitoring going Available!=True in 4.17 CI, and saw it linking the shipped-in-4.15 OCPBUGS-23745 exception. I've cloned this bug to report the issues we're still seeing. All of the 4.17 hits over the past 96 hours have Available going Unknown on UpdatingPrometheusFailed and mentioning client rate limiter.

      $ curl -s 'https://search.dptools.openshift.org/search?maxAge=96h&type=junit&context=0&name=periodic-.*4.17&search=clusteroperator/monitoring+condition/Available.*status/%5BUF%5D' | jq -r 'to_entries[].value | to_entries[].value[].context[]'
      Jun 18 16:55:17.636 - 2266s E clusteroperator/monitoring condition/Available reason/Unknown status/False Unknown (exception: We are not worried about Available=False or Degraded=True blips for stable-system tests yet.)
      Jun 17 12:17:22.197 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
      Jun 17 12:13:51.618 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
      Jun 17 12:20:39.844 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
      Jun 17 23:45:45.828 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
      Jun 18 18:41:31.933 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
      Jun 20 02:56:45.777 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
      

      Version-Release number of selected component

      I'm only looking at dev/4.17, for any forward-looking improvements.

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?maxAge=96h&type=junit&search=clusteroperator/monitoring+should+not+change+condition/Available' | grep '^periodic-.*4[.]17.*failures match' | sort
      periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 25 runs, 20% failed, 20% of failures match = 4% impact
      periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade (all) - 120 runs, 28% failed, 18% of failures match = 5% impact
      

      How reproducible

      5% impact in periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade is the current largest impact percentage on the most-frequently-run matching job.

      Steps to Reproduce

      Run periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade or another job with a combination of high-ish impact percentage and high run counts, watching the monitoring ClusterOperator's Available condition.

      Actual results

      Blips of Available=Unknown that resolve more quickly than a responding admin could be expected to show up.

      Expected results

      Only going Available!=True when it seems reasonable to summon an emergency admin response.

      Additional context

      The matching results from my CI Search query above are these runs:

      $ curl -s 'https://search.dptools.openshift.org/search?maxAge=96h&type=junit&context=0&name=periodic-.*4.17&search=clusteroperator/monitoring+condition/Available.*status/%5BUF%5D' | jq -r 'keys[]'
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.17-ocp-e2e-upgrade-aws-ovn-arm64/1803081477055844352
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645333503643648
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645338536808448
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802819236817539072
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1803104183897821184
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1803592572216545280
      

            [OCPBUGS-35892] monitoring ClusterOperator should not blip Available=Unknown on client rate limiter

            Coming back around to this, and running my original query from the Description again, but this time for 4.18 instead of 4.17:

            $ curl -s 'https://search.dptools.openshift.org/search?maxAge=96h&type=junit&context=0&name=periodic-.*4.18&search=clusteroperator/monitoring+condition/Available.*status/%5BUF%5D' | jq -r 'to_entries[] | .key + "\n" + (.value | to_entries[].value[].context | join("\n"))' | grep -v 'expected availability transition during upgrade'
            https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-metal-ovn-single-node-rt-upgrade-test/1874258998442594304
            Jan 01 03:32:40.820 - 913s  E clusteroperator/monitoring condition/Available reason/MultipleTasksFailed status/False UpdatingAlertmanager: reconciling Alertmanager Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: deleting Thanos Ruler Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io thanos-ruler), UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingUserWorkloadPrometheus: deleting UserWorkload federate Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io federate), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s)
            https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade-test/1874214154449981440
            https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-aws-upgrade-ovn-single-node/1873572334493765632
            https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade/1874214348725948416
            

            So only single-node, and it looks like there's an exception in place in the 4.19 origin suite about expected availability transition during upgrade issues on single-node going forward. Thanks!

            W. Trevor King added a comment - Coming back around to this, and running my original query from the Description again, but this time for 4.18 instead of 4.17: $ curl -s 'https://search.dptools.openshift.org/search?maxAge=96h&type=junit&context=0&name=periodic-.*4.18&search=clusteroperator/monitoring+condition/Available.*status/%5BUF%5D' | jq -r 'to_entries[] | .key + "\n" + (.value | to_entries[].value[].context | join("\n"))' | grep -v 'expected availability transition during upgrade' https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-metal-ovn-single-node-rt-upgrade-test/1874258998442594304 Jan 01 03:32:40.820 - 913s E clusteroperator/monitoring condition/Available reason/MultipleTasksFailed status/False UpdatingAlertmanager: reconciling Alertmanager Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: deleting Thanos Ruler Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io thanos-ruler), UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingUserWorkloadPrometheus: deleting UserWorkload federate Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io federate), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s) https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade-test/1874214154449981440 https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-aws-upgrade-ovn-single-node/1873572334493765632 https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade/1874214348725948416 So only single-node, and it looks like there's an exception in place in the 4.19 origin suite about expected availability transition during upgrade issues on single-node going forward. Thanks!

            trking, I'll go ahead and close this ticket based on my latest update. However, I know you won't hesitate to reopen it or create a new one if there's anything else we can improve on.

            Ayoub Mrini added a comment - trking , I'll go ahead and close this ticket based on my latest update. However, I know you won't hesitate to reopen it or create a new one if there's anything else we can improve on.

            Ayoub Mrini added a comment - - edited

            FYI trking
            A fix for https://issues.redhat.com/browse/OCPBUGS-17506 was recently merged (for 4.17, we aren't consider backporting for now)
            We should see more accurate error messages now instead of "client rate limiter Wait returned an error: context deadline exceeded"
            and also, more accurate statuses (available/degraded) will be set.

            Ayoub Mrini added a comment - - edited FYI trking A fix for https://issues.redhat.com/browse/OCPBUGS-17506 was recently merged (for 4.17, we aren't consider backporting for now) We should see more accurate error messages now instead of "client rate limiter Wait returned an error: context deadline exceeded" and also, more accurate statuses (available/degraded) will be set.

            I'm happy to take another look at CI Search results after OCPBUGS-17506 has been fixed for a while, once we get to that point, and we can measure instead of guessing

            W. Trevor King added a comment - I'm happy to take another look at CI Search results after OCPBUGS-17506 has been fixed for a while, once we get to that point, and we can measure instead of guessing

            Ayoub Mrini added a comment - - edited

            > But during a healthy roll like this, I don't think we should even be Degraded=True. E.g. if at t0 Prom-0 is down but Prom-1 is up, and at t5m Prom-0 is up but Prom-1 is down, and at t10m Prom-0 is up but Prom-1 is still down, and at t15m both Prom are happy again, that's just a healthy roll, working as expected, and not unexpected degradation or some quality-of-service violation we'd expect to improve on.

            I was assuming that at t15m, only one of the two Prom Pods was still available, that's why I talked about avilable=True Degraded=True
            Of course, if both of them are ready, CMO should yield avilable=True Degraded=False
            CMO should already know how to distinguish between these two cases, see https://github.com/openshift/cluster-monitoring-operator/blob/5341b80d4868fb1a330b14c8beff4c3ebba01b69/pkg/client/client.go#L918-L967
            But as discussed in https://issues.redhat.com/browse/OCPBUGS-17506, that util's retrying mechanism kind of swallows the more relevant findings and we should fix that.
            I think fixing OCPBUGS-17506, should also fix OCPBUGS-35892.

            Ayoub Mrini added a comment - - edited > But during a healthy roll like this, I don't think we should even be Degraded=True. E.g. if at t0 Prom-0 is down but Prom-1 is up, and at t5m Prom-0 is up but Prom-1 is down, and at t10m Prom-0 is up but Prom-1 is still down, and at t15m both Prom are happy again, that's just a healthy roll, working as expected, and not unexpected degradation or some quality-of-service violation we'd expect to improve on. I was assuming that at t15m, only one of the two Prom Pods was still available, that's why I talked about avilable=True Degraded=True Of course, if both of them are ready, CMO should yield avilable=True Degraded=False CMO should already know how to distinguish between these two cases, see https://github.com/openshift/cluster-monitoring-operator/blob/5341b80d4868fb1a330b14c8beff4c3ebba01b69/pkg/client/client.go#L918-L967 But as discussed in https://issues.redhat.com/browse/OCPBUGS-17506 , that util's retrying mechanism kind of swallows the more relevant findings and we should fix that. I think fixing OCPBUGS-17506 , should also fix OCPBUGS-35892 .

            UpdatingPrometheus should have returned Degraded=True,Available=True after the 3*5minutes retries...

            But during a healthy roll like this, I don't think we should even be Degraded=True. E.g. if at t0 Prom-0 is down but Prom-1 is up, and at t5m Prom-0 is up but Prom-1 is down, and at t10m Prom-0 is up but Prom-1 is still down, and at t15m both Prom are happy again, that's just a healthy roll, working as expected, and not unexpected degradation or some quality-of-service violation we'd expect to improve on.

            I have no problems with you working on OCPBUGS-17506 first, but I think the ask here is to either teach the operator a clearer picture of what a healthy compute-reboot looks like, or to make the naive duration window so long that more healthy compute-reboots fit inside it (with an "is there at least one healthy replica?" check tied to Available to cover the cases where you're actually not available, despite the Degraded long-naive-grace-period still cooking).

            W. Trevor King added a comment - UpdatingPrometheus should have returned Degraded=True,Available=True after the 3*5minutes retries... But during a healthy roll like this, I don't think we should even be Degraded=True . E.g. if at t0 Prom-0 is down but Prom-1 is up, and at t5m Prom-0 is up but Prom-1 is down, and at t10m Prom-0 is up but Prom-1 is still down, and at t15m both Prom are happy again, that's just a healthy roll, working as expected, and not unexpected degradation or some quality-of-service violation we'd expect to improve on. I have no problems with you working on OCPBUGS-17506 first, but I think the ask here is to either teach the operator a clearer picture of what a healthy compute-reboot looks like, or to make the naive duration window so long that more healthy compute-reboots fit inside it (with an "is there at least one healthy replica?" check tied to Available to cover the cases where you're actually not available, despite the Degraded long-naive-grace-period still cooking).

            Ayoub Mrini added a comment -

            > So from 12:01:59 through 12:18:30 we didn't know how UpdatingPrometheus was going, but only 12:17:22 through 12:18:30 has that reported as Available=Unknown? If I'm reading that right, I agree, we probably don't want to blindly extend the grace period further.

            Yes, during the 3*5minutes retries, the CMO status should stay vailable=True Degraded=False.

            > So maybe the ask for this bug is to expand the UpdatingPrometheus logic to be more nuanced about how it considers Prometheus availability? Because one member rolling in minutes as its host node drains and reboots, followed by the other member doing the same, sounds like it should be Available=True Degraded=False the whole time, as the system happily rides that healthy-churn wave. The current operator logic seems to expect the happy-state will include a period of stability every ~10m, and that seems like an overly-optimistic assumption during MachineConfigPool rolls or other healthy churn.

            Yes, that's what I was trying to say with the following part of my last comment:

            That being said, I agree that "client rate limiter" isn't the clearest and most helpful log and that Degraded=Unknown,Available=Unknown isn't the best/appropriate status in such cases, as I'm sure that during those tests, 1 out of the 2 Prometheus Pods was always available, it would indeed be more appropriate to report Degraded=True,Available=True. This is due to how we retry/poll, I started a discussion around this here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499
            

            UpdatingPrometheus should have returned Degraded=True,Available=True after the 3*5minutes retries, but given our retrying utils and how they're configured, we end up with the unfortunate Degraded=Unknown,Available=Unknown, I explained that more in details here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499

            If you agree with me, I think we can close this in favor of https://issues.redhat.com/browse/OCPBUGS-17506

            Ayoub Mrini added a comment - > So from 12:01:59 through 12:18:30 we didn't know how UpdatingPrometheus was going, but only 12:17:22 through 12:18:30 has that reported as Available=Unknown? If I'm reading that right, I agree, we probably don't want to blindly extend the grace period further. Yes, during the 3*5minutes retries, the CMO status should stay vailable=True Degraded=False . > So maybe the ask for this bug is to expand the UpdatingPrometheus logic to be more nuanced about how it considers Prometheus availability? Because one member rolling in minutes as its host node drains and reboots, followed by the other member doing the same, sounds like it should be Available=True Degraded=False the whole time, as the system happily rides that healthy-churn wave. The current operator logic seems to expect the happy-state will include a period of stability every ~10m, and that seems like an overly-optimistic assumption during MachineConfigPool rolls or other healthy churn. Yes, that's what I was trying to say with the following part of my last comment: That being said, I agree that "client rate limiter" isn 't the clearest and most helpful log and that Degraded=Unknown,Available=Unknown isn' t the best/appropriate status in such cases, as I'm sure that during those tests, 1 out of the 2 Prometheus Pods was always available, it would indeed be more appropriate to report Degraded=True,Available=True. This is due to how we retry/poll, I started a discussion around this here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel #comment-25064499 UpdatingPrometheus should have returned Degraded=True,Available=True after the 3*5minutes retries, but given our retrying utils and how they're configured, we end up with the unfortunate Degraded=Unknown,Available=Unknown , I explained that more in details here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499 If you agree with me, I think we can close this in favor of https://issues.redhat.com/browse/OCPBUGS-17506

            As an aside, the Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 to %!s(*string=0xc0021446c8) message in my previous comment may have been addressed here.

            W. Trevor King added a comment - As an aside, the Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 to %!s(*string=0xc0021446c8) message in my previous comment may have been addressed here .

            That run has:

            : [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less
            Run #0: Failed expand_less	1h54m55s
            {  0 unexpected clusteroperator state transitions during e2e test run, as desired.
            3 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:
            
            Jun 17 12:17:22.197 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
            Jun 17 12:17:22.197 - 68s   E clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745)
            Jun 17 12:18:30.986 W clusteroperator/monitoring condition/Available reason/RollOutDone status/True Successfully rolled out the stack. (exception: Available=True is the happy case)
            

            Trying to square that 68s of Available=Unknown with 3 tries at 5m intervals:

            $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-588cfbcf76-kxmtr_cluster-monitoring-operator.log | grep 'task 2 of 16' | grep -B8 12:18:
            30
            I0617 12:01:55.722279       1 tasks.go:76] ran task 2 of 16: UpdatingPrometheus
            I0617 12:01:59.099350       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
            W0617 12:07:04.776247       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline]
            I0617 12:07:08.052354       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
            W0617 12:12:13.524429       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded]
            I0617 12:12:16.743621       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
            W0617 12:17:22.163548       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded]
            I0617 12:17:25.431595       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
            I0617 12:18:30.907658       1 tasks.go:76] ran task 2 of 16: UpdatingPrometheus
            

            So from 12:01:59 through 12:18:30 we didn't know how UpdatingPrometheus was going, but only 12:17:22 through 12:18:30 has that reported as Available=Unknown? If I'm reading that right, I agree, we probably don't want to blindly extend the grace period further.

            But here it looks like the issue was a healthy roll:

            $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-monitoring" and (.involvedObject.name | startswith("prometheus-"))) | (.firstTimestamp // .eventTime // .metadate.creationTimestamp) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[]' | grep -v prometheus-operator | grep 'Stopping contai
            ner prometheus\|Unhealthy\|Scheduled\|Started container prometheus' | grep 2024-06-17T12
            2024-06-17T12:02:04Z Pod prometheus-k8s-1 Killing: Stopping container prometheus
            2024-06-17T12:02:07Z Pod prometheus-k8s-1 Unhealthy: Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of d20acbd8e7a46b39fe1de8a02ef60429fa834b2a070625eddaf570b07f8b16c0 is running failed: container process not found
            2024-06-17T12:09:40.640115Z Pod prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-j87ydxij-0e0a9-tjf7n-worker-b-6hnmr
            2024-06-17T12:10:56Z Pod prometheus-k8s-1 Started: Started container prometheus
            2024-06-17T12:11:58Z Pod prometheus-k8s-0 Killing: Stopping container prometheus
            2024-06-17T12:12:00Z Pod prometheus-k8s-0 Unhealthy: Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 9979245dfa2f422a3cbfd366b53e939a463e805f822af0429c8bbfd582ea91f1 is running failed: container process not found
            2024-06-17T12:17:45Z Pod prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6
            2024-06-17T12:17:55Z Pod prometheus-k8s-0 Started: Started container prometheus
            

            as the machine-config operator worked through the compute MachineConfigPool:

            $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .involvedObject.kind == "MachineConfigPool" and .involvedObject.name == "worker") | .firstTimestamp + " " + .reason + ": " + .message] | sort[]' | tail -n4
            2024-06-17T11:54:27Z RenderedConfigGenerated: rendered-worker-2f57e208991f934c42f4ae2b90157db9 successfully generated (release version: 4.17.0-0.nightly-2024-06-17-100744, controller version: 58d31f062a03680185f1c44f63decb77c7f1fabc)
            2024-06-17T11:54:34Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-a-m4nhj to %!s(*string=0xc001823748)
            2024-06-17T12:01:57Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-b-6hnmr to %!s(*string=0xc00293fa08)
            2024-06-17T12:09:50Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 to %!s(*string=0xc0021446c8)
            

            So maybe the ask for this bug is to expand the UpdatingPrometheus logic to be more nuanced about how it considers Prometheus availability? Because one member rolling in minutes as its host node drains and reboots, followed by the other member doing the same, sounds like it should be Available=True Degraded=False the whole time, as the system happily rides that healthy-churn wave. The current operator logic seems to expect the happy-state will include a period of stability every ~10m, and that seems like an overly-optimistic assumption during MachineConfigPool rolls or other healthy churn.

            W. Trevor King added a comment - That run has: : [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less Run #0: Failed expand_less 1h54m55s { 0 unexpected clusteroperator state transitions during e2e test run, as desired. 3 unwelcome but acceptable clusteroperator state transitions during e2e test run. These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail: Jun 17 12:17:22.197 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745) Jun 17 12:17:22.197 - 68s E clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded (exception: https://issues.redhat.com/browse/OCPBUGS-23745) Jun 17 12:18:30.986 W clusteroperator/monitoring condition/Available reason/RollOutDone status/True Successfully rolled out the stack. (exception: Available=True is the happy case) Trying to square that 68s of Available=Unknown with 3 tries at 5m intervals : $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-588cfbcf76-kxmtr_cluster-monitoring-operator.log | grep 'task 2 of 16' | grep -B8 12:18: 30 I0617 12:01:55.722279 1 tasks.go:76] ran task 2 of 16: UpdatingPrometheus I0617 12:01:59.099350 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus W0617 12:07:04.776247 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline] I0617 12:07:08.052354 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus W0617 12:12:13.524429 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded] I0617 12:12:16.743621 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus W0617 12:17:22.163548 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded] I0617 12:17:25.431595 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus I0617 12:18:30.907658 1 tasks.go:76] ran task 2 of 16: UpdatingPrometheus So from 12:01:59 through 12:18:30 we didn't know how UpdatingPrometheus was going, but only 12:17:22 through 12:18:30 has that reported as Available=Unknown ? If I'm reading that right, I agree, we probably don't want to blindly extend the grace period further. But here it looks like the issue was a healthy roll: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-monitoring" and (.involvedObject.name | startswith("prometheus-"))) | (.firstTimestamp // .eventTime // .metadate.creationTimestamp) + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message] | sort[]' | grep -v prometheus-operator | grep 'Stopping contai ner prometheus\|Unhealthy\|Scheduled\|Started container prometheus' | grep 2024-06-17T12 2024-06-17T12:02:04Z Pod prometheus-k8s-1 Killing: Stopping container prometheus 2024-06-17T12:02:07Z Pod prometheus-k8s-1 Unhealthy: Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of d20acbd8e7a46b39fe1de8a02ef60429fa834b2a070625eddaf570b07f8b16c0 is running failed: container process not found 2024-06-17T12:09:40.640115Z Pod prometheus-k8s-1 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-j87ydxij-0e0a9-tjf7n-worker-b-6hnmr 2024-06-17T12:10:56Z Pod prometheus-k8s-1 Started: Started container prometheus 2024-06-17T12:11:58Z Pod prometheus-k8s-0 Killing: Stopping container prometheus 2024-06-17T12:12:00Z Pod prometheus-k8s-0 Unhealthy: Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 9979245dfa2f422a3cbfd366b53e939a463e805f822af0429c8bbfd582ea91f1 is running failed: container process not found 2024-06-17T12:17:45Z Pod prometheus-k8s-0 Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-0 to ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 2024-06-17T12:17:55Z Pod prometheus-k8s-0 Started: Started container prometheus as the machine-config operator worked through the compute MachineConfigPool: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-machine-config-operator" and .involvedObject.kind == "MachineConfigPool" and .involvedObject.name == "worker") | .firstTimestamp + " " + .reason + ": " + .message] | sort[]' | tail -n4 2024-06-17T11:54:27Z RenderedConfigGenerated: rendered-worker-2f57e208991f934c42f4ae2b90157db9 successfully generated (release version: 4.17.0-0.nightly-2024-06-17-100744, controller version: 58d31f062a03680185f1c44f63decb77c7f1fabc) 2024-06-17T11:54:34Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-a-m4nhj to %!s(*string=0xc001823748) 2024-06-17T12:01:57Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-b-6hnmr to %!s(*string=0xc00293fa08) 2024-06-17T12:09:50Z SetDesiredConfig: Targeted node ci-op-j87ydxij-0e0a9-tjf7n-worker-c-lrtr6 to %!s(*string=0xc0021446c8) So maybe the ask for this bug is to expand the UpdatingPrometheus logic to be more nuanced about how it considers Prometheus availability? Because one member rolling in minutes as its host node drains and reboots, followed by the other member doing the same, sounds like it should be Available=True Degraded=False the whole time, as the system happily rides that healthy-churn wave. The current operator logic seems to expect the happy-state will include a period of stability every ~10m, and that seems like an overly-optimistic assumption during MachineConfigPool rolls or other healthy churn.

            Ayoub Mrini added a comment - - edited

            Thanks for the report Trevor,

            Taking a look at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752
            And at CMO's logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-588cfbcf76-kxmtr_cluster-monitoring-operator.log

            I see:

            ...
            I0617 12:01:59.099350       1 tasks.go:70] running task 2 of 16: UpdatingPrometheus
            ...
            W0617 12:07:04.776247       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline]
            I0617 12:07:04.776391       1 operator.go:887] 1 reconciliation(s) failed, 2 more attempt(s) will be made before reporting failures.
            E0617 12:07:04.776407       1 operator.go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed
            E0617 12:07:04.776435       1 operator.go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed)
            I0617 12:07:04.833211       1 operator.go:842] Updating ClusterOperator status to InProgress.
            ...
            W0617 12:12:13.524429       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded]
            I0617 12:12:13.524548       1 operator.go:887] 2 reconciliation(s) failed, 1 more attempt(s) will be made before reporting failures.
            E0617 12:12:13.524565       1 operator.go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed
            E0617 12:12:13.524580       1 operator.go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed)
            I0617 12:12:13.561518       1 operator.go:842] Updating ClusterOperator status to InProgress.
            ...
            W0617 12:17:22.163548       1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded]
            I0617 12:17:22.163618       1 operator.go:890] 3 reconciliations failed in a row, the threshold of 3 attempts has been reached, failures will be reported.
            E0617 12:17:22.181721       1 operator.go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed
            E0617 12:17:22.181754       1 operator.go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed)
            ...
            

            As expected, CMO had tried to sync prometheus 3 successive times before reporting failures.
            And each time it waited for 5 minutes.

            I don't see any unjustified or unexpected status changes. Maybe I'm missing something.
            And I don't think we should make CMO more lenient (wait for more than 3*5minutes before reporting a failure), do you think we should?

            That being said, I agree that "client rate limiter" isn't the clearest and most helpful log and that Degraded=Unknown,Available=Unknown isn't the best/appropriate status in such cases, as I'm sure that during those tests, 1 out of the 2 Prometheus Pods was always available, it would indeed be more appropriate to report Degraded=True,Available=True. This is due to how we retry/poll, I started a discussion around this here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499

            "Only going Available!=True when it seems reasonable to summon an emergency admin response."
            I didn't know that Available=Unknown requires an "emergency admin response" I thought only an explicit Available=True did.

            Ayoub Mrini added a comment - - edited Thanks for the report Trevor, Taking a look at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752 And at CMO's logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-gcp-ovn-rt-upgrade/1802645325958090752/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-monitoring_cluster-monitoring-operator-588cfbcf76-kxmtr_cluster-monitoring-operator.log I see: ... I0617 12:01:59.099350 1 tasks.go:70] running task 2 of 16: UpdatingPrometheus ... W0617 12:07:04.776247 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, degraded (unknown): client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline] I0617 12:07:04.776391 1 operator .go:887] 1 reconciliation(s) failed, 2 more attempt(s) will be made before reporting failures. E0617 12:07:04.776407 1 operator .go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed E0617 12:07:04.776435 1 operator .go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed) I0617 12:07:04.833211 1 operator .go:842] Updating ClusterOperator status to InProgress. ... W0617 12:12:13.524429 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded] I0617 12:12:13.524548 1 operator .go:887] 2 reconciliation(s) failed, 1 more attempt(s) will be made before reporting failures. E0617 12:12:13.524565 1 operator .go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed E0617 12:12:13.524580 1 operator .go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed) I0617 12:12:13.561518 1 operator .go:842] Updating ClusterOperator status to InProgress. ... W0617 12:17:22.163548 1 tasks.go:73] task 2 of 16: UpdatingPrometheus failed: [unavailable (unknown): client rate limiter Wait returned an error: context deadline exceeded, degraded (unknown): client rate limiter Wait returned an error: context deadline exceeded] I0617 12:17:22.163618 1 operator .go:890] 3 reconciliations failed in a row, the threshold of 3 attempts has been reached, failures will be reported. E0617 12:17:22.181721 1 operator .go:709] Syncing "openshift-monitoring/cluster-monitoring-config" failed E0617 12:17:22.181754 1 operator .go:710] sync "openshift-monitoring/cluster-monitoring-config" failed: cluster monitoring update failed (reason: UpdatingPrometheusFailed) ... As expected, CMO had tried to sync prometheus 3 successive times before reporting failures. And each time it waited for 5 minutes. I don't see any unjustified or unexpected status changes. Maybe I'm missing something. And I don't think we should make CMO more lenient (wait for more than 3*5minutes before reporting a failure), do you think we should? That being said, I agree that "client rate limiter" isn't the clearest and most helpful log and that Degraded=Unknown,Available=Unknown isn't the best/appropriate status in such cases, as I'm sure that during those tests, 1 out of the 2 Prometheus Pods was always available, it would indeed be more appropriate to report Degraded=True,Available=True. This is due to how we retry/poll, I started a discussion around this here https://issues.redhat.com/browse/OCPBUGS-17506?focusedId=25064499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25064499 — "Only going Available!=True when it seems reasonable to summon an emergency admin response." I didn't know that Available=Unknown requires an "emergency admin response" I thought only an explicit Available=True did.

              rh-ee-amrini Ayoub Mrini
              trking W. Trevor King
              Junqi Zhao Junqi Zhao
              Brian Burt Brian Burt
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: