Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60017

GCP Upgrades Failing Due to Monitoring Operator Degraded

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • MON Sprint 274
    • 1
    • In Progress
    • Bug Fix
    • Hide
      Before this release, in multi-zone clusters with a single worker per zone, the Monitoring Operator could degrade if the two nodes running its Prometheus pods rebooted sequentially and each took longer than 15 minutes to recover. With this release, the timeout is extended to 20 minutes reducing the likelihood of the Monitoring Operator entering a degraded state on common cluster topologies. (link:https://issues.redhat.com/browse/OCPBUGS-60017[OCPBUGS-60017])
      Show
      Before this release, in multi-zone clusters with a single worker per zone, the Monitoring Operator could degrade if the two nodes running its Prometheus pods rebooted sequentially and each took longer than 15 minutes to recover. With this release, the timeout is extended to 20 minutes reducing the likelihood of the Monitoring Operator entering a degraded state on common cluster topologies. (link: https://issues.redhat.com/browse/OCPBUGS-60017 [ OCPBUGS-60017 ])
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-59962. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-59932. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-57215. The following is the description of the original issue:

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

      Significant regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 87.20%.
      Regression is triaged and believed fixed as of 2025-06-06T16:00:00Z.

      Sample (being evaluated) Release: 4.19
      Start Time: 2025-06-02T00:00:00Z
      End Time: 2025-06-09T12:00:00Z
      Success Rate: 87.20%
      Successes: 218
      Failures: 32
      Flakes: 0

      Base (historical) Release: 4.15
      Start Time: 2024-01-29T00:00:00Z
      End Time: 2024-02-28T00:00:00Z
      Success Rate: 100.00%
      Successes: 625
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      This test unfortunately suffered a major outage late last week, but has failed an alarming number of times since with:

      {  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:200]: cluster is reporting a failing condition: Cluster operator monitoring is degraded
      Ginkgo exit error 1: exit with code 1}
      

      Sample job runs are those in the report linked above since around Jun 7th. There appear to be about 6.

      Example from yesterday:

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade/1931532175954415616

      Intervals show:

      source/OperatorDegraded display/true condition/Degraded reason/UpdatingPrometheusFailed status/True UpdatingPrometheus: Prometheus "openshift-monitoring/k8s": SomePodsNotReady: shard 0: pod prometheus-k8s-1: 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling. [2m49s]
      

              jfajersk@redhat.com Jan Fajerski
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              Junqi Zhao Junqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: