Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59962

GCP Upgrades Failing Due to Monitoring Operator Degraded

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • MON Sprint 274
    • 1
    • Done
    • Bug Fix
    • Hide
      Before this update, in multi-zone clusters with only a single worker per zone, if the Monitoring Operator's Prometheus pods were scheduled to nodes that reboot back-to-back and both reboots took longer than 15 minutes to return to service, the Monitoring Operator might have degraded. With this release, the time-out has been extended to 20 minutes to prevent the Monitoring Operator from entering a degraded state on common cluster topologies. Clusters where the two nodes with Prometheus pods reboot back-to-back and take more than 20 minutes might still report a degraded state until the second node and Prometheus pod return to a normal state. (link:https://issues.redhat.com/browse/OCPBUGS-59962[OCPBUGS-59962])
      Show
      Before this update, in multi-zone clusters with only a single worker per zone, if the Monitoring Operator's Prometheus pods were scheduled to nodes that reboot back-to-back and both reboots took longer than 15 minutes to return to service, the Monitoring Operator might have degraded. With this release, the time-out has been extended to 20 minutes to prevent the Monitoring Operator from entering a degraded state on common cluster topologies. Clusters where the two nodes with Prometheus pods reboot back-to-back and take more than 20 minutes might still report a degraded state until the second node and Prometheus pod return to a normal state. (link: https://issues.redhat.com/browse/OCPBUGS-59962 [ OCPBUGS-59962 ])
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-59932. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-57215. The following is the description of the original issue:

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

      Significant regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 87.20%.
      Regression is triaged and believed fixed as of 2025-06-06T16:00:00Z.

      Sample (being evaluated) Release: 4.19
      Start Time: 2025-06-02T00:00:00Z
      End Time: 2025-06-09T12:00:00Z
      Success Rate: 87.20%
      Successes: 218
      Failures: 32
      Flakes: 0

      Base (historical) Release: 4.15
      Start Time: 2024-01-29T00:00:00Z
      End Time: 2024-02-28T00:00:00Z
      Success Rate: 100.00%
      Successes: 625
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      This test unfortunately suffered a major outage late last week, but has failed an alarming number of times since with:

      {  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:200]: cluster is reporting a failing condition: Cluster operator monitoring is degraded
      Ginkgo exit error 1: exit with code 1}
      

      Sample job runs are those in the report linked above since around Jun 7th. There appear to be about 6.

      Example from yesterday:

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade/1931532175954415616

      Intervals show:

      source/OperatorDegraded display/true condition/Degraded reason/UpdatingPrometheusFailed status/True UpdatingPrometheus: Prometheus "openshift-monitoring/k8s": SomePodsNotReady: shard 0: pod prometheus-k8s-1: 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling. [2m49s]
      

              jfajersk@redhat.com Jan Fajerski
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              Junqi Zhao Junqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: